XML splitting query very slow

Question 1

I have a query to split a delimited string into to multiple columns. The query is extremely slow and takes more than 5 minutes to handle 80444 rows. How can I improve execution time?

 WITH cte (Execution_ID,Intake_Generic_ID, [File_Name],F1,Report_Attributes) AS ( SELECT 
 Execution_ID, Intake_Generic_ID,
 [File_Name], F1,
 CONVERT(XML,'<Report><Attribute>' 
 + REPLACE(F1,'~', '</Attribute><Attribute>') 
 + '</Attribute></Report>') AS Report_Attributes FROM [dbo].[Intake_Generic] )
SELECT 
 Execution_ID, Intake_Generic_ID,
 [File_Name],
 LTRIM(RTRIM(Report.value('Attribute[1]','varchar(50)'))) AS [TransactionDate],
 LTRIM(RTRIM(Report.value('Attribute[2]','varchar(50)'))) AS [TransactionNum],
 LTRIM(RTRIM(Report.value('Attribute[3]','varchar(50)'))) AS [Interchange],
 LTRIM(RTRIM(Report.value('Attribute[4]','varchar(50)'))) AS [Gantry], 
 LTRIM(RTRIM(Report.value('Attribute[6]','varchar(50)'))) AS [GantryDirection], 
 LTRIM(RTRIM(Report.value('Attribute[7]','varchar(50)'))) AS [GantryEntryExit], 
 LTRIM(RTRIM(Report.value('Attribute[8]','varchar(50)'))) AS [RTCID], 
 LTRIM(RTRIM(Report.value('Attribute[9]','varchar(100)'))) AS [Vehicle_Classification_Cd], 
 LTRIM(RTRIM(Report.value('Attribute[10]','varchar(100)'))) AS [VehicleClassification], 
 LTRIM(RTRIM(Report.value('Attribute[11]','varchar(100)'))) AS [Txn_Iden_Cd], 
 LTRIM(RTRIM(Report.value('Attribute[12]','varchar(100)'))) AS [TransactionIdentification] FROM cte S 
 CROSS APPLY Report_Attributes.nodes('/Report') AS Tbl(Report) order by Intake_Generic_ID

Question 2

Why is Intake_Generic.F1 stored in that horrible format to begin with? And why do you need to use XML at all?

Question 3

@200_success Intake_Generic.F1 is a ~ delimited string which I get from the source file.

Question 4

But why is it loaded into the database as one long string rather than as a table?

Question 5

@200_success - I have no control on how it is been loaded. I have visibility to only the intake table with the long string.

Question 6

@need_the_buzz please do not change the code in your question after answers have been posted. See What should I do when someone answers my question?

Question 7

This won't do much for the performance, but maintaining this kind of T-SQL is painful, so proper formatting is essential.

The FROM clause is hidden, and one has to pay very close attention to notice the ORDER BY clause as well. Consider:

WITH cte (Execution_ID,Intake_Generic_ID, [File_Name],F1,Report_Attributes) AS ( 
 SELECT 
 Execution_ID
 ,Intake_Generic_ID
 ,[File_Name]
 ,F1
 ,CONVERT(XML,
 '<Report><Attribute>' + 
 REPLACE(F1,'~', '</Attribute <Attribute>') + 
 '</Attribute></Report>') AS Report_Attributes 
 FROM [dbo].[Intake_Generic]
)
SELECT 
 Execution_ID
 ,Intake_Generic_ID
 ,[File_Name]
 ,LTRIM(RTRIM(Report.value('Attribute[1]','varchar(50)'))) AS [TransactionDate]
 ,LTRIM(RTRIM(Report.value('Attribute[2]','varchar(50)'))) AS [TransactionNum]
 ,LTRIM(RTRIM(Report.value('Attribute[3]','varchar(50)'))) AS [Interchange]
 ,LTRIM(RTRIM(Report.value('Attribute[4]','varchar(50)'))) AS [Gantry]
 ,LTRIM(RTRIM(Report.value('Attribute[6]','varchar(50)'))) AS [GantryDirection]
 ,LTRIM(RTRIM(Report.value('Attribute[7]','varchar(50)'))) AS [GantryEntryExit]
 ,LTRIM(RTRIM(Report.value('Attribute[8]','varchar(50)'))) AS [RTCID]
 ,LTRIM(RTRIM(Report.value('Attribute[9]','varchar(100)'))) AS [Vehicle_Classification_Cd]
 ,LTRIM(RTRIM(Report.value('Attribute[10]','varchar(100)'))) AS [VehicleClassification]
 ,LTRIM(RTRIM(Report.value('Attribute[11]','varchar(100)'))) AS [Txn_Iden_Cd]
 ,LTRIM(RTRIM(Report.value('Attribute[12]','varchar(100)'))) AS [TransactionIdentification] 
FROM cte S 
 CROSS APPLY Report_Attributes.nodes('/Report') AS Tbl(Report)
ORDER BY Intake_Generic_ID

XQuery is slower than "normal" T-SQL querying, pretty much by definition.

If the results don't need to be sorted, or if they can be sorted by whoever is consuming this data, then removing the ORDER BY clause should give you a little boost, but I suspect the CROSS APPLY is much more expensive than the sorting.

Looking more closely at what the query is doing, I'd say the XQuery isn't required for this to work - it was clever, but as you saw, a total performance killer.

 ,F1
 ,CONVERT(XML,
 '<Report><Attribute>' + 
 REPLACE(F1,'~', '</Attribute <Attribute>') + 
 '</Attribute></Report>') AS Report_Attributes

You're building the XML from the contents of F1, by replacing occurrences of ~ within the varchar value.

If the data is shaped anything like a list of ~~(削除) comma (削除ここまで)~~ tilde-separated values, you could write a table-valued function that returns a table with 12 records given that F1 string:

declare @result as table (
 AttributeIndex int
 ,AttributeValue varchar(50)
);

Your CTE could be CROSS APPLY-ing that function with the contents of Intake_Generic, and the actual SELECT wouldn't have much work left to do. And without XML involved, I'm willing to bet that you'd get the results much, much faster.

Question 8

Thanks for your answer. I took out the order by and the CTE from the query. The query is done in like 14 seconds now. I agree to your comment to use the table valued function for better performance, but we are prohibited from creating functions in production.

Question 9

@need_the_buzz interesting.. could you be missing an index?

Question 10

@Mat'sMug, splitting a character delimited string via this "xml trick", is the fastest approach AFAIC. The TV-UDF you are speaking of, would have to do the splitting in the one or other way too. CHARINDEX and its friends won't perform better...

Question 11

@Mat'sMug About your question above "could you be missing and index?" The CTE is handled very similar to a declared table variable. The optimizer won't be able to use the indexes for sorting outside of the CTE. This is one (maybe the biggest) weakness of CTEs. They should - if ever possible - be used to handle rather small row counts, to prepare variables and do pre-calculations.

Question 12

@Mat'sMug, "need_the_buzz" posted some code which brought me to some more comparisons. It might be of interest for you too. If you let all 4 solutions work in one go, the percentage of the batches is 1, 1, 8 and 90... at least with the very small amount of test-data...

Question 13

According to comments below this is a new, shorter code without sorting, to make it comparable:

First I create test data

CREATE TABLE Intake_Generic(Execution_ID INT,Intake_Generic_ID INT,[File_Name] VARCHAR(100),F1 VARCHAR(MAX));
INSERt INTO Intake_Generic VALUES
 (1,2,'SomeFileName','This~is~a~test')
,(2,1,'SomeFileName','And~just~another~test');
GO

Now "my" code with a CTE

WITH cte 
AS
(
 SELECT 
 Execution_ID,
 Intake_Generic_ID,
 [File_Name],
 CAST('<x>' + REPLACE(LTRIM(RTRIM(F1)),'~', '</x><x>') + '</x>' AS XML) AS Report_Attributes
 FROM [dbo].[Intake_Generic]
)
SELECT 
 Execution_ID,
 Intake_Generic_ID,
 [File_Name],
 LTRIM(RTRIM(Report_Attributes.value('x[1]','varchar(50)'))) AS [Attrib1],
 LTRIM(RTRIM(Report_Attributes.value('x[2]','varchar(50)'))) AS [Attrib2],
 LTRIM(RTRIM(Report_Attributes.value('x[3]','varchar(50)'))) AS [Attrib3],
 LTRIM(RTRIM(Report_Attributes.value('x[4]','varchar(50)'))) AS [Attrib4]
FROM cte
GO

The CTE will be inlined and therefore produce exactly the same execution plan as "your" approach with the XML-cast in the FROM clause (not exactly as you copied it into the comment, because this wouldn't produce the same result, see below):

SELECT Execution_ID,
 Intake_Generic_ID,
 [File_Name],
 LTRIM(RTRIM(NewXML.value('x[1]','varchar(50)'))) AS [Attrib1],
 LTRIM(RTRIM(NewXML.value('x[2]','varchar(50)'))) AS [Attrib2],
 LTRIM(RTRIM(NewXML.value('x[3]','varchar(50)'))) AS [Attrib3],
 LTRIM(RTRIM(NewXML.value('x[4]','varchar(50)'))) AS [Attrib4]
FROM ( SELECT NewXML=CAST('<x>'+REPLACE(F1,'~','</x><x>')+'</x>' AS XML) 
 ,Execution_ID
 ,Intake_Generic_ID
 ,[File_Name] 
 FROM Intake_Generic ) AS CastXML 
GO

This is the code as you posted it into your comment. It is - according to the execution plan - about 8 times slower and comes back with only the first value:

SELECT Execution_ID
 , Intake_Generic_ID
 , [File_Name]
 , LTRIM(RTRIM( x.i.value('i[1]','varchar(50)'))) AS [TransactionDate] 
FROM ( SELECT NewXML=CAST('<Report><i>'+REPLACE(F1,'~','</i><i>')+'</i></Report>' AS XML) 
 ,Execution_ID
 ,Intake_Generic_ID
 ,[File_Name] 
 FROM Intake_Generic ) AS CastXML 
CROSS APPLY NewXML.nodes('/Report') x(i);
GO

And finally these is the code as it comes from the other answer (reduced to produce the same result for comparability):

It takes more than 90% of the batch...

WITH cte (Execution_ID,Intake_Generic_ID, [File_Name],Report_Attributes) AS ( 
 SELECT 
 Execution_ID
 ,Intake_Generic_ID
 ,[File_Name]
 ,CONVERT(XML,
 '<Report><Attribute>' + 
 REPLACE(F1,'~', '</Attribute><Attribute>') + 
 '</Attribute></Report>') AS Report_Attributes 
 FROM [dbo].[Intake_Generic]
)
SELECT 
 Execution_ID
 ,Intake_Generic_ID
 ,[File_Name]
 ,LTRIM(RTRIM(Report.value('Attribute[1]','varchar(50)'))) AS [TransactionDate]
 ,LTRIM(RTRIM(Report.value('Attribute[2]','varchar(50)'))) AS [TransactionNum]
 ,LTRIM(RTRIM(Report.value('Attribute[3]','varchar(50)'))) AS [Interchange]
 ,LTRIM(RTRIM(Report.value('Attribute[4]','varchar(50)'))) AS [Gantry]
FROM cte S 
 CROSS APPLY Report_Attributes.nodes('/Report') AS Tbl(Report)
GO

So: Please check these approaches. I'm afraid your code might be quick but only because it doesn't produce what you expect... (or I got wrong what you need)

Question 14

please explain your answer more so the OP and future readers can understand it.

Question 15

@chillworld added some explanation. Hope this is clear now..

Question 16

I voted the answer, but can you do it next time outside the coding part? Btw, welcome here.

Question 17

@Shnugo - I have tried your query. It took 43 seconds to run against my dataset but the query I have right now as per Mat's comments took me only 14 seconds.

Question 18

@need_the_buzz The only reason for this can be the sorting. Please try to take away the "SortInx" everywhere and test it again. Is there an index in the column you want to sort after?

Question 19

There is a known issue with the XML method of string splitting where it can perform horribly because of repeated re-evaluations of the REPLACE and the CAST to XML. More details in this blog post.

The best performing method is CLR but it sounds as though this will not be an option for you. As the maximum split index of interest is known in advance a more robust method, avoiding XML, might be

SELECT Execution_ID,
 Intake_Generic_ID,
 File_Name,
 F1, 
 LTRIM(RTRIM(SUBSTRING(F, C0, C1- C0 - 1))) AS TransactionDate, 
 LTRIM(RTRIM(SUBSTRING(F, C1, C2- C1 - 1))) AS TransactionNum, 
 LTRIM(RTRIM(SUBSTRING(F, C2, C3- C2 - 1))) AS Interchange, 
 LTRIM(RTRIM(SUBSTRING(F, C3, C4- C3 - 1))) AS Gantry, 
-- LTRIM(RTRIM(SUBSTRING(F, C4, C5- C4 - 1))) AS Unknown, 
 LTRIM(RTRIM(SUBSTRING(F, C5, C6- C5 - 1))) AS GantryDirection, 
 LTRIM(RTRIM(SUBSTRING(F, C6, C7- C6 - 1))) AS GantryEntryExit, 
 LTRIM(RTRIM(SUBSTRING(F, C7, C8- C7 - 1))) AS RTCID, 
 LTRIM(RTRIM(SUBSTRING(F, C8, C9- C8 - 1))) AS Vehicle_Classification_Cd, 
 LTRIM(RTRIM(SUBSTRING(F, C9, C10- C9 - 1))) AS VehicleClassification, 
 LTRIM(RTRIM(SUBSTRING(F,C10, C11-C10 - 1))) AS Txn_Iden_Cd, 
 LTRIM(RTRIM(SUBSTRING(F,C11, C12-C11 - 1))) AS TransactionIdentification
FROM [dbo].[Intake_Generic] 
CROSS APPLY (VALUES(1, F1 + REPLICATE('~',12))) V0(C0,F)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C0 ))) V1 (C1)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C1 ))) V2 (C2)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C2 ))) V3 (C3)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C3 ))) V4 (C4)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C4 ))) V5 (C5)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C5 ))) V6 (C6)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C6 ))) V7 (C7)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C7 ))) V8 (C8)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C8 ))) V9 (C9)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C9) )) V10(C10)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C10))) V11(C11)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C11))) V12(C12)

Or another way would be to use Jeff Moden's approach from this article with the necessary adjustments to work without creating a function and to PIVOT the results into a single row.

WITH E1(N)
 AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
 SELECT 1), --10E+1 or 10 rows
 E2(N)
 AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
 E4(N)
 AS (SELECT 1 FROM E2 a, E2 b) --10E+4 or 10,000 rows max
 SELECT Execution_ID,
 Intake_Generic_ID,
 File_Name,
 F1, 
 [1] AS TransactionDate, 
 [2] AS TransactionNum, 
 [3] AS Interchange, 
 [4] AS Gantry, 
 -- [5] AS Unknown, 
 [6] AS GantryDirection, 
 [7] AS GantryEntryExit, 
 [8] AS RTCID, 
 [9] AS Vehicle_Classification_Cd, 
 [10] AS VehicleClassification, 
 [11] AS Txn_Iden_Cd, 
 [12] AS TransactionIdentification
FROM dbo.Intake_Generic ig
 CROSS APPLY (SELECT *
 FROM (SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
 Item = LTRIM(RTRIM(SUBSTRING(ig.F1, l.N1, l.L1)))
 FROM (SELECT N1 = s.N1,
 L1 = ISNULL(NULLIF(CHARINDEX('~', ig.F1, s.N1), 0) - s.N1, 8000)
 FROM (SELECT N1 = 1
 UNION ALL
 SELECT N1 = Nums.N + 1
 FROM (SELECT TOP (ISNULL(DATALENGTH(ig.F1), 0)) 
 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
 FROM E4) Nums(N)
 WHERE SUBSTRING(ig.F1, Nums.N, 1) = '~') s) l) 
 f
 PIVOT (MAX(Item) 
 FOR ItemNumber IN ([1], [2], [3], 
 [4], [5], [6], 
 [7], [8], [9], 
 [10], [11], [12])) P
 ) CA;

score 4 · Accepted Answer · 2015-12-15 19:15:31Z

This won't do much for the performance, but maintaining this kind of T-SQL is painful, so proper formatting is essential.

The FROM clause is hidden, and one has to pay very close attention to notice the ORDER BY clause as well. Consider:

WITH cte (Execution_ID,Intake_Generic_ID, [File_Name],F1,Report_Attributes) AS ( 
 SELECT 
 Execution_ID
 ,Intake_Generic_ID
 ,[File_Name]
 ,F1
 ,CONVERT(XML,
 '<Report><Attribute>' + 
 REPLACE(F1,'~', '</Attribute <Attribute>') + 
 '</Attribute></Report>') AS Report_Attributes 
 FROM [dbo].[Intake_Generic]
)
SELECT 
 Execution_ID
 ,Intake_Generic_ID
 ,[File_Name]
 ,LTRIM(RTRIM(Report.value('Attribute[1]','varchar(50)'))) AS [TransactionDate]
 ,LTRIM(RTRIM(Report.value('Attribute[2]','varchar(50)'))) AS [TransactionNum]
 ,LTRIM(RTRIM(Report.value('Attribute[3]','varchar(50)'))) AS [Interchange]
 ,LTRIM(RTRIM(Report.value('Attribute[4]','varchar(50)'))) AS [Gantry]
 ,LTRIM(RTRIM(Report.value('Attribute[6]','varchar(50)'))) AS [GantryDirection]
 ,LTRIM(RTRIM(Report.value('Attribute[7]','varchar(50)'))) AS [GantryEntryExit]
 ,LTRIM(RTRIM(Report.value('Attribute[8]','varchar(50)'))) AS [RTCID]
 ,LTRIM(RTRIM(Report.value('Attribute[9]','varchar(100)'))) AS [Vehicle_Classification_Cd]
 ,LTRIM(RTRIM(Report.value('Attribute[10]','varchar(100)'))) AS [VehicleClassification]
 ,LTRIM(RTRIM(Report.value('Attribute[11]','varchar(100)'))) AS [Txn_Iden_Cd]
 ,LTRIM(RTRIM(Report.value('Attribute[12]','varchar(100)'))) AS [TransactionIdentification] 
FROM cte S 
 CROSS APPLY Report_Attributes.nodes('/Report') AS Tbl(Report)
ORDER BY Intake_Generic_ID

XQuery is slower than "normal" T-SQL querying, pretty much by definition.

If the results don't need to be sorted, or if they can be sorted by whoever is consuming this data, then removing the ORDER BY clause should give you a little boost, but I suspect the CROSS APPLY is much more expensive than the sorting.

Looking more closely at what the query is doing, I'd say the XQuery isn't required for this to work - it was clever, but as you saw, a total performance killer.

 ,F1
 ,CONVERT(XML,
 '<Report><Attribute>' + 
 REPLACE(F1,'~', '</Attribute <Attribute>') + 
 '</Attribute></Report>') AS Report_Attributes

You're building the XML from the contents of F1, by replacing occurrences of ~ within the varchar value.

If the data is shaped anything like a list of ~~(削除) comma (削除ここまで)~~ tilde-separated values, you could write a table-valued function that returns a table with 12 records given that F1 string:

declare @result as table (
 AttributeIndex int
 ,AttributeValue varchar(50)
);

Your CTE could be CROSS APPLY-ing that function with the contents of Intake_Generic, and the actual SELECT wouldn't have much work left to do. And without XML involved, I'm willing to bet that you'd get the results much, much faster.

Thanks for your answer. I took out the order by and the CTE from the query. The query is done in like 14 seconds now. I agree to your comment to use the table valued function for better performance, but we are prohibited from creating functions in production.
@Mat'sMug, splitting a character delimited string via this "xml trick", is the fastest approach AFAIC. The TV-UDF you are speaking of, would have to do the splitting in the one or other way too. CHARINDEX and its friends won't perform better...
@Mat'sMug About your question above "could you be missing and index?" The CTE is handled very similar to a declared table variable. The optimizer won't be able to use the indexes for sorting outside of the CTE. This is one (maybe the biggest) weakness of CTEs. They should - if ever possible - be used to handle rather small row counts, to prepare variables and do pre-calculations.
@Mat'sMug, "need_the_buzz" posted some code which brought me to some more comparisons. It might be of interest for you too. If you let all 4 solutions work in one go, the percentage of the batches is 1, 1, 8 and 90... at least with the very small amount of test-data...

Stack Exchange Network

XML splitting query very slow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

XML splitting query very slow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions