T-SQL query outer apply with nested subqueries

Question 1

I have the following 3 tables, and I don't have the option to modify them.

[Logs] table with [log_type], [log_time], [log_data], and other irrelevant columns
[Places] table with [name], [place_index], and other irrelevant columns
[Addresses] table with [place_index], [road_0], ...,[road_15], and other irrelevant columns

Logs of [log_type] 12 or 15 has one of the roads in the [Addresses] table in [log_data]; logs of [log_type] 1 has one of the names from [Places] table; logs of [log_type] 7 is similar to type 1 but with a space and some other info after the name.

What the query has to do is to select a list of "12" or "15" logs which

does not have a "1" log before it, which corresponds to a place (log_data) containing its corresponding road.

or

has the "1" log before it but with a "7" log in between, which also corresponds to the same place.

For example,

assuming Road_X is in PlaceI, Road_Y is in PlaceJ, and Road_Ex, Road_Z are in PlaceK, and given the following table.

--------------------------------------------
| id | log_type | log_time | log_data |
--------------------------------------------
| 1 | 7 | 2018年01月30日 | PlaceA 2.3 |
--------------------------------------------
| 2 | 1 | 2018年01月30日 | PlaceB |
--------------------------------------------
| 3 | 12 | 2018年01月31日 | Road_Ex |
--------------------------------------------
| 4 | 7 | 2018年02月01日 | PlaceB 2.5 |
--------------------------------------------
| 5 | 7 | 2018年02月01日 | PlaceI 2.7 |
--------------------------------------------
| 6 | 15 | 2018年02月02日 | Road_X |
--------------------------------------------
| 7 | 1 | 2018年02月04日 | PlaceI |
--------------------------------------------
| 8 | 1 | 2018年02月05日 | PlaceJ |
--------------------------------------------
| 9 | 12 | 2018年02月06日 | Road_X |
--------------------------------------------
| 10 | 15 | 2018年02月06日 | Road_Y |
--------------------------------------------
| 11 | 1 | 2018年02月10日 | PlaceK |
--------------------------------------------
| 12 | 12 | 2018年02月11日 | Road_Z |
--------------------------------------------
| 13 | 12 | 2018年02月11日 | Road_Ex |
--------------------------------------------
| 14 | 7 | 2018年02月12日 | PlaceI 2.7 |
--------------------------------------------
| 15 | 7 | 2018年02月12日 | PlaceJ 2.8 |
--------------------------------------------
| 16 | 12 | 2018年02月17日 | Road_X |
--------------------------------------------
| 17 | 15 | 2018年02月18日 | Road_Y |
--------------------------------------------

The query should return rows 3, 6, 16, 17.

This is the best I could come up with so far:

declare @since datetimeoffset
set @since = '2018-02-10-05:00'
select a.[log_type], a.[log_time], a.[log_data]
from 
 (select [log_type], [log_data], [log_time],
 (select [name] 
 from [Places]
 where [place_index] =
 (select [place_index] 
 from [Addresses] 
 where [log_data] in 
 ([road_0], [road_1], [road_2], [road_3], [road_4], 
 [road_5], [road_6], [road_7], [road_8], [road_9],
 [road_10], [road_11], [road_12], [road_13], [road_14], 
 [road_15]))) as [place_name]
 from [Logs]
 where [log_type] in (12, 15) and [log_time] >= @since) a
outer apply 
 (select TOP 1 [log_data] from 
 (select TOP 1 [log_data], [log_time]
 from [Logs]
 where [log_type] = 1 and [log_data] = [place_name] 
 and [log_time] >= @since and [log_time] < a.[log_time]
 order by [log_time] desc
 UNION ALL
 (select TOP 1 NULL, [log_time]
 from [Logs]
 where [log_type] = 7 and LEFT([log_data], CHARINDEX(' ', [log_data]) - 1) = [place_name] 
 and [log_time] >= @since and [log_time] < a.[log_time]
 order by [log_time] desc)) i
 order by [log_time] desc) b
where b.[log_data] is null
order by a.[log_time] desc

Note that because I need to reference [place_name] from table a in the subquery, I had to use outer apply instead of left join. However, it takes 5 seconds to generate less than 40 records from around 30k records in [Logs]. I've also tried replacing outer apply with where, but there's no notable difference.

Can the query be further optimized or will I need to move this business logic to the service that sends this query?

EDIT

DDL for the tables (only including relevant columns):

CREATE TABLE [Logs](
 [log_time] [datetimeoffset](7) NOT NULL, -- indexed
 [log_type] [int] NULL,
 [log_data] [varchar](300) NULL
)
CREATE TABLE [Places](
 [place_index] [smallint] NOT NULL, -- PK, indexed
 [name] [nchar](253) NULL
)
CREATE TABLE [Addresses](
 [place_index] [smallint] NOT NULL, -- PK, indexed
 [road_0] [char](32) NULL,
 [road_1] [char](32) NULL,
 [road_2] [char](32) NULL,
 [road_3] [char](32) NULL,
 [road_4] [char](32) NULL,
 [road_5] [char](32) NULL,
 [road_6] [char](32) NULL,
 [road_7] [char](32) NULL,
 [road_8] [char](32) NULL,
 [road_9] [char](32) NULL,
 [road_10] [char](32) NULL,
 [road_11] [char](32) NULL,
 [road_12] [char](32) NULL,
 [road_13] [char](32) NULL,
 [road_14] [char](32) NULL,
 [road_15] [char](32) NULL
)

Question 2

Can you post DDL for the tables? That would save a tremendous amount of time reviewing the query.

Question 3

@WesH Thanks for the advice. I've included DDL for the 3 tables.

Question 4

Can you also include some data insert scripts? That will ensure my results are correct and I also really don't want to type your data in manually. We'll just need enough to validate your question.

Question 5

You didn't include ID in your DDL. What table is that part of?

Question 6

Here is an attempt, though I can't verify it works because you didn't provide insert scripts and didn't include the ID column in the DDL. You also don't mention if log 1/7 is only considered for the NEXT log 12/15 for the same place or if it applies to ALL log 12/15 after the log 1/7 for that place. This script assumes all.

DECLARE @since DATETIMEOFFSET
;
SET @since = '2018-02-10-05:00'
;
WITH PrevLogs AS
 (
 SELECT p.place_index, /* return the place index for log one */
 One.log_time,
 Seven.log_type
 FROM dbo.Logs AS One
 JOIN dbo.Places AS p
 ON One.log_data = p.name
 LEFT JOIN dbo.Logs AS Seven
 ON Seven.log_data LIKE One.log_data + '%' /* Log data One matches the first portion of Log data Seven */
 AND Seven.log_type = 7 /* Only match log Seven to the log One */
 AND One.log_time < Seven.log_time /* Seven log has to be after the One log */
 WHERE One.log_type = 1 /* Only return logs of type 1 */
 ),
 LogsIWant AS
 (
 SELECT a.place_index, /* return the place index of the raods in log 12/15 */
 l.log_time,
 l.log_type,
 l.log_data
 FROM dbo.Logs AS l
 JOIN dbo.Addresses AS a
 ON l.log_data IN ( a.road_0, a.road_1, a.road_2, a.road_3, a.road_4, a.road_5, a.road_6, a.road_7, a.road_8, a.road_9, a.road_10, a.road_11, a.road_12, a.road_13, a.road_14, a.road_15 )
 WHERE l.log_type IN ( 12, 15 )
 AND l.log_time >= @since
 )
 SELECT liw.place_index,
 liw.log_time,
 liw.log_type,
 liw.log_data,
 pl.place_index,
 pl.log_type
 FROM LogsIWant AS liw
 LEFT JOIN PrevLogs AS pl
 ON liw.place_index = pl.place_index
 AND pl.log_time < liw.log_time /* Log 1/7 are before Log 12/15 */
 WHERE pl.place_index IS NULL /* Log 1 does not exist */
 OR pl.log_type IS NOT NULL /* Log 1 & Log 7 exist */
;

To your question of can this be further optimized, I would answer not really. This structure violates so many principals of data quality, it would take less time to explain what it does do right.

Twice in my IT career I have chosen to change jobs rather than put up with the mandate that "Change is not an option". If I was handed a system that was structured like this and told I could not make changes, I would do the best I could while looking for another job.

Sorry to be blunt, but the real problem here is a horrible data model.

I don't know if adding an index would count as a change, but a simple improvement you could make would be a non-clustered index on the name column of the Place table. That would at least improve the join performance from the logs table.

Good luck

Question 7

Thanks for the query and your honest advice! "...a horrible data model." I can't agree more... We have bad designs all over the place with no time given to re-design. Adding an index should be possible, but I believe the bottle neck is with the outer apply (or where) subquery. The main query runs within a second for more than 100k records, which is good enough for my purpose. I will try out your query and let you know.

Question 8

Forgot to add, [place_index] is the ID of [Places] and [Addresses], and [Logs] doesn't have an ID...

Question 9

I've tried your solution and even added an index on log_data, but the result is the same. I actually have 6 versions, all of which perform almost, if not exactly, the same, with or without index... I think you pointed the right direction though. If I take out the requirement of checking log_type 7, then using LEFT JOIN with index on log_data out performs the other versions. Thanks for the effort regardless!

Wes H Wes H 3731 silver badge3 bronze badges · Accepted Answer · 2018-02-22 15:01:22Z

Here is an attempt, though I can't verify it works because you didn't provide insert scripts and didn't include the ID column in the DDL. You also don't mention if log 1/7 is only considered for the NEXT log 12/15 for the same place or if it applies to ALL log 12/15 after the log 1/7 for that place. This script assumes all.

DECLARE @since DATETIMEOFFSET
;
SET @since = '2018-02-10-05:00'
;
WITH PrevLogs AS
 (
 SELECT p.place_index, /* return the place index for log one */
 One.log_time,
 Seven.log_type
 FROM dbo.Logs AS One
 JOIN dbo.Places AS p
 ON One.log_data = p.name
 LEFT JOIN dbo.Logs AS Seven
 ON Seven.log_data LIKE One.log_data + '%' /* Log data One matches the first portion of Log data Seven */
 AND Seven.log_type = 7 /* Only match log Seven to the log One */
 AND One.log_time < Seven.log_time /* Seven log has to be after the One log */
 WHERE One.log_type = 1 /* Only return logs of type 1 */
 ),
 LogsIWant AS
 (
 SELECT a.place_index, /* return the place index of the raods in log 12/15 */
 l.log_time,
 l.log_type,
 l.log_data
 FROM dbo.Logs AS l
 JOIN dbo.Addresses AS a
 ON l.log_data IN ( a.road_0, a.road_1, a.road_2, a.road_3, a.road_4, a.road_5, a.road_6, a.road_7, a.road_8, a.road_9, a.road_10, a.road_11, a.road_12, a.road_13, a.road_14, a.road_15 )
 WHERE l.log_type IN ( 12, 15 )
 AND l.log_time >= @since
 )
 SELECT liw.place_index,
 liw.log_time,
 liw.log_type,
 liw.log_data,
 pl.place_index,
 pl.log_type
 FROM LogsIWant AS liw
 LEFT JOIN PrevLogs AS pl
 ON liw.place_index = pl.place_index
 AND pl.log_time < liw.log_time /* Log 1/7 are before Log 12/15 */
 WHERE pl.place_index IS NULL /* Log 1 does not exist */
 OR pl.log_type IS NOT NULL /* Log 1 & Log 7 exist */
;

To your question of can this be further optimized, I would answer not really. This structure violates so many principals of data quality, it would take less time to explain what it does do right.

Twice in my IT career I have chosen to change jobs rather than put up with the mandate that "Change is not an option". If I was handed a system that was structured like this and told I could not make changes, I would do the best I could while looking for another job.

Sorry to be blunt, but the real problem here is a horrible data model.

I don't know if adding an index would count as a change, but a simple improvement you could make would be a non-clustered index on the name column of the Place table. That would at least improve the join performance from the logs table.

Good luck

Thanks for the query and your honest advice! "...a horrible data model." I can't agree more... We have bad designs all over the place with no time given to re-design. Adding an index should be possible, but I believe the bottle neck is with the outer apply (or where) subquery. The main query runs within a second for more than 100k records, which is good enough for my purpose. I will try out your query and let you know.
Forgot to add, [place_index] is the ID of [Places] and [Addresses], and [Logs] doesn't have an ID...
I've tried your solution and even added an index on log_data, but the result is the same. I actually have 6 versions, all of which perform almost, if not exactly, the same, with or without index... I think you pointed the right direction though. If I take out the requirement of checking log_type 7, then using LEFT JOIN with index on log_data out performs the other versions. Thanks for the effort regardless!

Stack Exchange Network

T-SQL query outer apply with nested subqueries

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

T-SQL query outer apply with nested subqueries

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions