I have a query to split a delimited string into to multiple columns. The query is extremely slow and takes more than 5 minutes to handle 80444 rows. How can I improve execution time?
WITH cte (Execution_ID,Intake_Generic_ID, [File_Name],F1,Report_Attributes) AS ( SELECT
Execution_ID, Intake_Generic_ID,
[File_Name], F1,
CONVERT(XML,'<Report><Attribute>'
+ REPLACE(F1,'~', '</Attribute><Attribute>')
+ '</Attribute></Report>') AS Report_Attributes FROM [dbo].[Intake_Generic] )
SELECT
Execution_ID, Intake_Generic_ID,
[File_Name],
LTRIM(RTRIM(Report.value('Attribute[1]','varchar(50)'))) AS [TransactionDate],
LTRIM(RTRIM(Report.value('Attribute[2]','varchar(50)'))) AS [TransactionNum],
LTRIM(RTRIM(Report.value('Attribute[3]','varchar(50)'))) AS [Interchange],
LTRIM(RTRIM(Report.value('Attribute[4]','varchar(50)'))) AS [Gantry],
LTRIM(RTRIM(Report.value('Attribute[6]','varchar(50)'))) AS [GantryDirection],
LTRIM(RTRIM(Report.value('Attribute[7]','varchar(50)'))) AS [GantryEntryExit],
LTRIM(RTRIM(Report.value('Attribute[8]','varchar(50)'))) AS [RTCID],
LTRIM(RTRIM(Report.value('Attribute[9]','varchar(100)'))) AS [Vehicle_Classification_Cd],
LTRIM(RTRIM(Report.value('Attribute[10]','varchar(100)'))) AS [VehicleClassification],
LTRIM(RTRIM(Report.value('Attribute[11]','varchar(100)'))) AS [Txn_Iden_Cd],
LTRIM(RTRIM(Report.value('Attribute[12]','varchar(100)'))) AS [TransactionIdentification] FROM cte S
CROSS APPLY Report_Attributes.nodes('/Report') AS Tbl(Report) order by Intake_Generic_ID
3 Answers 3
This won't do much for the performance, but maintaining this kind of T-SQL is painful, so proper formatting is essential.
The FROM
clause is hidden, and one has to pay very close attention to notice the ORDER BY
clause as well. Consider:
WITH cte (Execution_ID,Intake_Generic_ID, [File_Name],F1,Report_Attributes) AS (
SELECT
Execution_ID
,Intake_Generic_ID
,[File_Name]
,F1
,CONVERT(XML,
'<Report><Attribute>' +
REPLACE(F1,'~', '</Attribute <Attribute>') +
'</Attribute></Report>') AS Report_Attributes
FROM [dbo].[Intake_Generic]
)
SELECT
Execution_ID
,Intake_Generic_ID
,[File_Name]
,LTRIM(RTRIM(Report.value('Attribute[1]','varchar(50)'))) AS [TransactionDate]
,LTRIM(RTRIM(Report.value('Attribute[2]','varchar(50)'))) AS [TransactionNum]
,LTRIM(RTRIM(Report.value('Attribute[3]','varchar(50)'))) AS [Interchange]
,LTRIM(RTRIM(Report.value('Attribute[4]','varchar(50)'))) AS [Gantry]
,LTRIM(RTRIM(Report.value('Attribute[6]','varchar(50)'))) AS [GantryDirection]
,LTRIM(RTRIM(Report.value('Attribute[7]','varchar(50)'))) AS [GantryEntryExit]
,LTRIM(RTRIM(Report.value('Attribute[8]','varchar(50)'))) AS [RTCID]
,LTRIM(RTRIM(Report.value('Attribute[9]','varchar(100)'))) AS [Vehicle_Classification_Cd]
,LTRIM(RTRIM(Report.value('Attribute[10]','varchar(100)'))) AS [VehicleClassification]
,LTRIM(RTRIM(Report.value('Attribute[11]','varchar(100)'))) AS [Txn_Iden_Cd]
,LTRIM(RTRIM(Report.value('Attribute[12]','varchar(100)'))) AS [TransactionIdentification]
FROM cte S
CROSS APPLY Report_Attributes.nodes('/Report') AS Tbl(Report)
ORDER BY Intake_Generic_ID
XQuery is slower than "normal" T-SQL querying, pretty much by definition.
If the results don't need to be sorted, or if they can be sorted by whoever is consuming this data, then removing the ORDER BY
clause should give you a little boost, but I suspect the CROSS APPLY
is much more expensive than the sorting.
Looking more closely at what the query is doing, I'd say the XQuery isn't required for this to work - it was clever, but as you saw, a total performance killer.
,F1 ,CONVERT(XML, '<Report><Attribute>' + REPLACE(F1,'~', '</Attribute <Attribute>') + '</Attribute></Report>') AS Report_Attributes
You're building the XML from the contents of F1
, by replacing occurrences of ~
within the varchar
value.
If the data is shaped anything like a list of (削除) comma (削除ここまで) tilde-separated values, you could write a table-valued function that returns a table with 12 records given that F1
string:
declare @result as table (
AttributeIndex int
,AttributeValue varchar(50)
);
Your CTE could be CROSS APPLY
-ing that function with the contents of Intake_Generic
, and the actual SELECT
wouldn't have much work left to do. And without XML involved, I'm willing to bet that you'd get the results much, much faster.
-
1\$\begingroup\$ Thanks for your answer. I took out the order by and the CTE from the query. The query is done in like 14 seconds now. I agree to your comment to use the table valued function for better performance, but we are prohibited from creating functions in production. \$\endgroup\$need_the_buzz– need_the_buzz2015年12月15日 19:52:12 +00:00Commented Dec 15, 2015 at 19:52
-
\$\begingroup\$ @need_the_buzz interesting.. could you be missing an index? \$\endgroup\$Mathieu Guindon– Mathieu Guindon2015年12月16日 00:52:05 +00:00Commented Dec 16, 2015 at 0:52
-
\$\begingroup\$ @Mat'sMug, splitting a character delimited string via this "xml trick", is the fastest approach AFAIC. The TV-UDF you are speaking of, would have to do the splitting in the one or other way too. CHARINDEX and its friends won't perform better... \$\endgroup\$Gottfried Lesigang– Gottfried Lesigang2015年12月16日 08:39:09 +00:00Commented Dec 16, 2015 at 8:39
-
\$\begingroup\$ @Mat'sMug About your question above "could you be missing and index?" The CTE is handled very similar to a declared table variable. The optimizer won't be able to use the indexes for sorting outside of the CTE. This is one (maybe the biggest) weakness of CTEs. They should - if ever possible - be used to handle rather small row counts, to prepare variables and do pre-calculations. \$\endgroup\$Gottfried Lesigang– Gottfried Lesigang2015年12月16日 08:46:07 +00:00Commented Dec 16, 2015 at 8:46
-
\$\begingroup\$ @Mat'sMug, "need_the_buzz" posted some code which brought me to some more comparisons. It might be of interest for you too. If you let all 4 solutions work in one go, the percentage of the batches is 1, 1, 8 and 90... at least with the very small amount of test-data... \$\endgroup\$Gottfried Lesigang– Gottfried Lesigang2015年12月17日 20:07:34 +00:00Commented Dec 17, 2015 at 20:07
According to comments below this is a new, shorter code without sorting, to make it comparable:
First I create test data
CREATE TABLE Intake_Generic(Execution_ID INT,Intake_Generic_ID INT,[File_Name] VARCHAR(100),F1 VARCHAR(MAX));
INSERt INTO Intake_Generic VALUES
(1,2,'SomeFileName','This~is~a~test')
,(2,1,'SomeFileName','And~just~another~test');
GO
Now "my" code with a CTE
WITH cte
AS
(
SELECT
Execution_ID,
Intake_Generic_ID,
[File_Name],
CAST('<x>' + REPLACE(LTRIM(RTRIM(F1)),'~', '</x><x>') + '</x>' AS XML) AS Report_Attributes
FROM [dbo].[Intake_Generic]
)
SELECT
Execution_ID,
Intake_Generic_ID,
[File_Name],
LTRIM(RTRIM(Report_Attributes.value('x[1]','varchar(50)'))) AS [Attrib1],
LTRIM(RTRIM(Report_Attributes.value('x[2]','varchar(50)'))) AS [Attrib2],
LTRIM(RTRIM(Report_Attributes.value('x[3]','varchar(50)'))) AS [Attrib3],
LTRIM(RTRIM(Report_Attributes.value('x[4]','varchar(50)'))) AS [Attrib4]
FROM cte
GO
The CTE will be inlined and therefore produce exactly the same execution plan as "your" approach with the XML-cast in the FROM clause (not exactly as you copied it into the comment, because this wouldn't produce the same result, see below):
SELECT Execution_ID,
Intake_Generic_ID,
[File_Name],
LTRIM(RTRIM(NewXML.value('x[1]','varchar(50)'))) AS [Attrib1],
LTRIM(RTRIM(NewXML.value('x[2]','varchar(50)'))) AS [Attrib2],
LTRIM(RTRIM(NewXML.value('x[3]','varchar(50)'))) AS [Attrib3],
LTRIM(RTRIM(NewXML.value('x[4]','varchar(50)'))) AS [Attrib4]
FROM ( SELECT NewXML=CAST('<x>'+REPLACE(F1,'~','</x><x>')+'</x>' AS XML)
,Execution_ID
,Intake_Generic_ID
,[File_Name]
FROM Intake_Generic ) AS CastXML
GO
This is the code as you posted it into your comment. It is - according to the execution plan - about 8 times slower and comes back with only the first value:
SELECT Execution_ID
, Intake_Generic_ID
, [File_Name]
, LTRIM(RTRIM( x.i.value('i[1]','varchar(50)'))) AS [TransactionDate]
FROM ( SELECT NewXML=CAST('<Report><i>'+REPLACE(F1,'~','</i><i>')+'</i></Report>' AS XML)
,Execution_ID
,Intake_Generic_ID
,[File_Name]
FROM Intake_Generic ) AS CastXML
CROSS APPLY NewXML.nodes('/Report') x(i);
GO
And finally these is the code as it comes from the other answer (reduced to produce the same result for comparability):
It takes more than 90% of the batch...
WITH cte (Execution_ID,Intake_Generic_ID, [File_Name],Report_Attributes) AS (
SELECT
Execution_ID
,Intake_Generic_ID
,[File_Name]
,CONVERT(XML,
'<Report><Attribute>' +
REPLACE(F1,'~', '</Attribute><Attribute>') +
'</Attribute></Report>') AS Report_Attributes
FROM [dbo].[Intake_Generic]
)
SELECT
Execution_ID
,Intake_Generic_ID
,[File_Name]
,LTRIM(RTRIM(Report.value('Attribute[1]','varchar(50)'))) AS [TransactionDate]
,LTRIM(RTRIM(Report.value('Attribute[2]','varchar(50)'))) AS [TransactionNum]
,LTRIM(RTRIM(Report.value('Attribute[3]','varchar(50)'))) AS [Interchange]
,LTRIM(RTRIM(Report.value('Attribute[4]','varchar(50)'))) AS [Gantry]
FROM cte S
CROSS APPLY Report_Attributes.nodes('/Report') AS Tbl(Report)
GO
So: Please check these approaches. I'm afraid your code might be quick but only because it doesn't produce what you expect... (or I got wrong what you need)
-
\$\begingroup\$ please explain your answer more so the OP and future readers can understand it. \$\endgroup\$chillworld– chillworld2015年12月16日 09:37:20 +00:00Commented Dec 16, 2015 at 9:37
-
\$\begingroup\$ @chillworld added some explanation. Hope this is clear now.. \$\endgroup\$Gottfried Lesigang– Gottfried Lesigang2015年12月16日 10:19:13 +00:00Commented Dec 16, 2015 at 10:19
-
\$\begingroup\$ I voted the answer, but can you do it next time outside the coding part? Btw, welcome here. \$\endgroup\$chillworld– chillworld2015年12月16日 13:28:19 +00:00Commented Dec 16, 2015 at 13:28
-
\$\begingroup\$ @Shnugo - I have tried your query. It took 43 seconds to run against my dataset but the query I have right now as per Mat's comments took me only 14 seconds. \$\endgroup\$need_the_buzz– need_the_buzz2015年12月17日 14:51:37 +00:00Commented Dec 17, 2015 at 14:51
-
1\$\begingroup\$ @need_the_buzz The only reason for this can be the sorting. Please try to take away the "SortInx" everywhere and test it again. Is there an index in the column you want to sort after? \$\endgroup\$Gottfried Lesigang– Gottfried Lesigang2015年12月17日 15:11:55 +00:00Commented Dec 17, 2015 at 15:11
There is a known issue with the XML method of string splitting where it can perform horribly because of repeated re-evaluations of the REPLACE
and the CAST
to XML
. More details in this blog post.
The best performing method is CLR but it sounds as though this will not be an option for you. As the maximum split index of interest is known in advance a more robust method, avoiding XML, might be
SELECT Execution_ID,
Intake_Generic_ID,
File_Name,
F1,
LTRIM(RTRIM(SUBSTRING(F, C0, C1- C0 - 1))) AS TransactionDate,
LTRIM(RTRIM(SUBSTRING(F, C1, C2- C1 - 1))) AS TransactionNum,
LTRIM(RTRIM(SUBSTRING(F, C2, C3- C2 - 1))) AS Interchange,
LTRIM(RTRIM(SUBSTRING(F, C3, C4- C3 - 1))) AS Gantry,
-- LTRIM(RTRIM(SUBSTRING(F, C4, C5- C4 - 1))) AS Unknown,
LTRIM(RTRIM(SUBSTRING(F, C5, C6- C5 - 1))) AS GantryDirection,
LTRIM(RTRIM(SUBSTRING(F, C6, C7- C6 - 1))) AS GantryEntryExit,
LTRIM(RTRIM(SUBSTRING(F, C7, C8- C7 - 1))) AS RTCID,
LTRIM(RTRIM(SUBSTRING(F, C8, C9- C8 - 1))) AS Vehicle_Classification_Cd,
LTRIM(RTRIM(SUBSTRING(F, C9, C10- C9 - 1))) AS VehicleClassification,
LTRIM(RTRIM(SUBSTRING(F,C10, C11-C10 - 1))) AS Txn_Iden_Cd,
LTRIM(RTRIM(SUBSTRING(F,C11, C12-C11 - 1))) AS TransactionIdentification
FROM [dbo].[Intake_Generic]
CROSS APPLY (VALUES(1, F1 + REPLICATE('~',12))) V0(C0,F)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C0 ))) V1 (C1)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C1 ))) V2 (C2)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C2 ))) V3 (C3)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C3 ))) V4 (C4)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C4 ))) V5 (C5)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C5 ))) V6 (C6)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C6 ))) V7 (C7)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C7 ))) V8 (C8)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C8 ))) V9 (C9)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C9) )) V10(C10)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C10))) V11(C11)
CROSS APPLY (VALUES(1 + CHARINDEX('~',F, C11))) V12(C12)
Or another way would be to use Jeff Moden's approach from this article with the necessary adjustments to work without creating a function and to PIVOT
the results into a single row.
WITH E1(N)
AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1), --10E+1 or 10 rows
E2(N)
AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N)
AS (SELECT 1 FROM E2 a, E2 b) --10E+4 or 10,000 rows max
SELECT Execution_ID,
Intake_Generic_ID,
File_Name,
F1,
[1] AS TransactionDate,
[2] AS TransactionNum,
[3] AS Interchange,
[4] AS Gantry,
-- [5] AS Unknown,
[6] AS GantryDirection,
[7] AS GantryEntryExit,
[8] AS RTCID,
[9] AS Vehicle_Classification_Cd,
[10] AS VehicleClassification,
[11] AS Txn_Iden_Cd,
[12] AS TransactionIdentification
FROM dbo.Intake_Generic ig
CROSS APPLY (SELECT *
FROM (SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = LTRIM(RTRIM(SUBSTRING(ig.F1, l.N1, l.L1)))
FROM (SELECT N1 = s.N1,
L1 = ISNULL(NULLIF(CHARINDEX('~', ig.F1, s.N1), 0) - s.N1, 8000)
FROM (SELECT N1 = 1
UNION ALL
SELECT N1 = Nums.N + 1
FROM (SELECT TOP (ISNULL(DATALENGTH(ig.F1), 0))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E4) Nums(N)
WHERE SUBSTRING(ig.F1, Nums.N, 1) = '~') s) l)
f
PIVOT (MAX(Item)
FOR ItemNumber IN ([1], [2], [3],
[4], [5], [6],
[7], [8], [9],
[10], [11], [12])) P
) CA;
Intake_Generic.F1
stored in that horrible format to begin with? And why do you need to use XML at all? \$\endgroup\$