I have the following 3 tables, and I don't have the option to modify them.
[Logs]
table with[log_type]
,[log_time]
,[log_data]
, and other irrelevant columns[Places]
table with[name]
,[place_index]
, and other irrelevant columns[Addresses]
table with[place_index]
,[road_0]
, ...,[road_15]
, and other irrelevant columns
Logs of [log_type]
12 or 15 has one of the roads in the [Addresses]
table in [log_data]
; logs of [log_type]
1 has one of the names from [Places]
table; logs of [log_type]
7 is similar to type 1 but with a space and some other info after the name.
What the query has to do is to select a list of "12" or "15" logs which
- does not have a "1" log before it, which corresponds to a place (log_data) containing its corresponding road.
or
- has the "1" log before it but with a "7" log in between, which also corresponds to the same place.
For example,
assuming Road_X is in PlaceI, Road_Y is in PlaceJ, and Road_Ex, Road_Z are in PlaceK, and given the following table.
--------------------------------------------
| id | log_type | log_time | log_data |
--------------------------------------------
| 1 | 7 | 2018年01月30日 | PlaceA 2.3 |
--------------------------------------------
| 2 | 1 | 2018年01月30日 | PlaceB |
--------------------------------------------
| 3 | 12 | 2018年01月31日 | Road_Ex |
--------------------------------------------
| 4 | 7 | 2018年02月01日 | PlaceB 2.5 |
--------------------------------------------
| 5 | 7 | 2018年02月01日 | PlaceI 2.7 |
--------------------------------------------
| 6 | 15 | 2018年02月02日 | Road_X |
--------------------------------------------
| 7 | 1 | 2018年02月04日 | PlaceI |
--------------------------------------------
| 8 | 1 | 2018年02月05日 | PlaceJ |
--------------------------------------------
| 9 | 12 | 2018年02月06日 | Road_X |
--------------------------------------------
| 10 | 15 | 2018年02月06日 | Road_Y |
--------------------------------------------
| 11 | 1 | 2018年02月10日 | PlaceK |
--------------------------------------------
| 12 | 12 | 2018年02月11日 | Road_Z |
--------------------------------------------
| 13 | 12 | 2018年02月11日 | Road_Ex |
--------------------------------------------
| 14 | 7 | 2018年02月12日 | PlaceI 2.7 |
--------------------------------------------
| 15 | 7 | 2018年02月12日 | PlaceJ 2.8 |
--------------------------------------------
| 16 | 12 | 2018年02月17日 | Road_X |
--------------------------------------------
| 17 | 15 | 2018年02月18日 | Road_Y |
--------------------------------------------
The query should return rows 3, 6, 16, 17.
This is the best I could come up with so far:
declare @since datetimeoffset
set @since = '2018-02-10-05:00'
select a.[log_type], a.[log_time], a.[log_data]
from
(select [log_type], [log_data], [log_time],
(select [name]
from [Places]
where [place_index] =
(select [place_index]
from [Addresses]
where [log_data] in
([road_0], [road_1], [road_2], [road_3], [road_4],
[road_5], [road_6], [road_7], [road_8], [road_9],
[road_10], [road_11], [road_12], [road_13], [road_14],
[road_15]))) as [place_name]
from [Logs]
where [log_type] in (12, 15) and [log_time] >= @since) a
outer apply
(select TOP 1 [log_data] from
(select TOP 1 [log_data], [log_time]
from [Logs]
where [log_type] = 1 and [log_data] = [place_name]
and [log_time] >= @since and [log_time] < a.[log_time]
order by [log_time] desc
UNION ALL
(select TOP 1 NULL, [log_time]
from [Logs]
where [log_type] = 7 and LEFT([log_data], CHARINDEX(' ', [log_data]) - 1) = [place_name]
and [log_time] >= @since and [log_time] < a.[log_time]
order by [log_time] desc)) i
order by [log_time] desc) b
where b.[log_data] is null
order by a.[log_time] desc
Note that because I need to reference [place_name]
from table a
in the subquery, I had to use outer apply
instead of left join
. However, it takes 5 seconds to generate less than 40 records from around 30k records in [Logs]
. I've also tried replacing outer apply
with where
, but there's no notable difference.
Can the query be further optimized or will I need to move this business logic to the service that sends this query?
EDIT
DDL for the tables (only including relevant columns):
CREATE TABLE [Logs](
[log_time] [datetimeoffset](7) NOT NULL, -- indexed
[log_type] [int] NULL,
[log_data] [varchar](300) NULL
)
CREATE TABLE [Places](
[place_index] [smallint] NOT NULL, -- PK, indexed
[name] [nchar](253) NULL
)
CREATE TABLE [Addresses](
[place_index] [smallint] NOT NULL, -- PK, indexed
[road_0] [char](32) NULL,
[road_1] [char](32) NULL,
[road_2] [char](32) NULL,
[road_3] [char](32) NULL,
[road_4] [char](32) NULL,
[road_5] [char](32) NULL,
[road_6] [char](32) NULL,
[road_7] [char](32) NULL,
[road_8] [char](32) NULL,
[road_9] [char](32) NULL,
[road_10] [char](32) NULL,
[road_11] [char](32) NULL,
[road_12] [char](32) NULL,
[road_13] [char](32) NULL,
[road_14] [char](32) NULL,
[road_15] [char](32) NULL
)
-
\$\begingroup\$ Can you post DDL for the tables? That would save a tremendous amount of time reviewing the query. \$\endgroup\$Wes H– Wes H2018年02月21日 14:09:43 +00:00Commented Feb 21, 2018 at 14:09
-
\$\begingroup\$ @WesH Thanks for the advice. I've included DDL for the 3 tables. \$\endgroup\$Thomas Hsieh– Thomas Hsieh2018年02月22日 03:35:43 +00:00Commented Feb 22, 2018 at 3:35
-
\$\begingroup\$ Can you also include some data insert scripts? That will ensure my results are correct and I also really don't want to type your data in manually. We'll just need enough to validate your question. \$\endgroup\$Wes H– Wes H2018年02月22日 14:05:23 +00:00Commented Feb 22, 2018 at 14:05
-
\$\begingroup\$ You didn't include ID in your DDL. What table is that part of? \$\endgroup\$Wes H– Wes H2018年02月22日 14:31:31 +00:00Commented Feb 22, 2018 at 14:31
1 Answer 1
Here is an attempt, though I can't verify it works because you didn't provide insert scripts and didn't include the ID column in the DDL. You also don't mention if log 1/7 is only considered for the NEXT log 12/15 for the same place or if it applies to ALL log 12/15 after the log 1/7 for that place. This script assumes all.
DECLARE @since DATETIMEOFFSET
;
SET @since = '2018-02-10-05:00'
;
WITH PrevLogs AS
(
SELECT p.place_index, /* return the place index for log one */
One.log_time,
Seven.log_type
FROM dbo.Logs AS One
JOIN dbo.Places AS p
ON One.log_data = p.name
LEFT JOIN dbo.Logs AS Seven
ON Seven.log_data LIKE One.log_data + '%' /* Log data One matches the first portion of Log data Seven */
AND Seven.log_type = 7 /* Only match log Seven to the log One */
AND One.log_time < Seven.log_time /* Seven log has to be after the One log */
WHERE One.log_type = 1 /* Only return logs of type 1 */
),
LogsIWant AS
(
SELECT a.place_index, /* return the place index of the raods in log 12/15 */
l.log_time,
l.log_type,
l.log_data
FROM dbo.Logs AS l
JOIN dbo.Addresses AS a
ON l.log_data IN ( a.road_0, a.road_1, a.road_2, a.road_3, a.road_4, a.road_5, a.road_6, a.road_7, a.road_8, a.road_9, a.road_10, a.road_11, a.road_12, a.road_13, a.road_14, a.road_15 )
WHERE l.log_type IN ( 12, 15 )
AND l.log_time >= @since
)
SELECT liw.place_index,
liw.log_time,
liw.log_type,
liw.log_data,
pl.place_index,
pl.log_type
FROM LogsIWant AS liw
LEFT JOIN PrevLogs AS pl
ON liw.place_index = pl.place_index
AND pl.log_time < liw.log_time /* Log 1/7 are before Log 12/15 */
WHERE pl.place_index IS NULL /* Log 1 does not exist */
OR pl.log_type IS NOT NULL /* Log 1 & Log 7 exist */
;
To your question of can this be further optimized, I would answer not really. This structure violates so many principals of data quality, it would take less time to explain what it does do right.
Twice in my IT career I have chosen to change jobs rather than put up with the mandate that "Change is not an option". If I was handed a system that was structured like this and told I could not make changes, I would do the best I could while looking for another job.
Sorry to be blunt, but the real problem here is a horrible data model.
I don't know if adding an index would count as a change, but a simple improvement you could make would be a non-clustered index on the name column of the Place table. That would at least improve the join performance from the logs table.
Good luck
-
\$\begingroup\$ Thanks for the query and your honest advice! "...a horrible data model." I can't agree more... We have bad designs all over the place with no time given to re-design. Adding an index should be possible, but I believe the bottle neck is with the outer apply (or where) subquery. The main query runs within a second for more than 100k records, which is good enough for my purpose. I will try out your query and let you know. \$\endgroup\$Thomas Hsieh– Thomas Hsieh2018年02月23日 12:45:15 +00:00Commented Feb 23, 2018 at 12:45
-
\$\begingroup\$ Forgot to add,
[place_index]
is the ID of[Places]
and[Addresses]
, and[Logs]
doesn't have an ID... \$\endgroup\$Thomas Hsieh– Thomas Hsieh2018年02月23日 12:54:22 +00:00Commented Feb 23, 2018 at 12:54 -
\$\begingroup\$ I've tried your solution and even added an index on
log_data
, but the result is the same. I actually have 6 versions, all of which perform almost, if not exactly, the same, with or without index... I think you pointed the right direction though. If I take out the requirement of checkinglog_type
7, then usingLEFT JOIN
with index onlog_data
out performs the other versions. Thanks for the effort regardless! \$\endgroup\$Thomas Hsieh– Thomas Hsieh2018年02月26日 02:55:21 +00:00Commented Feb 26, 2018 at 2:55
Explore related questions
See similar questions with these tags.