Normalizing Invoices

Question 1

I need to move data from one database to another, and since I don't have SSIS I'm doing this ETL with T-SQL scripts.

One of the source tables contains invoice details, and features a column that contains the number of units invoiced per size (it's clothing); as I transfer the data over to my database, I'm normalizing this information. The query works fine, the execution plan looks exactly as I expected, doesn't recommend adding any indexes, ...but it takes about 10 minutes to process.

The source table Staging.dbo.[SourceTable] contains about 780K rows; the query inserts 1.5M rows into the destination table [DestinationDatabase].dbo.InvoiceDetailSizes.

The query is scheduled to run daily, as part of an overnight process - the 10 minutes don't really matter, but still I want to be sure everything is as efficient as it can be.

with cteSizedInvoices (
 InvoiceNumber, InvoiceLine, SizeRangeCode, UnitsPerSize
)
as (select
 src.f2,
 cast(src.f5 as int),
 src.f23,
 src.f19
 from Staging.dbo.[SourceTable] src
 where src.f1 = '01'
)
--insert into [DestinationDatabase].dbo.InvoiceDetailSizes (InvoiceDetailId, SizeId, Units, DateInserted)
select
 detail.Id InvoiceDetailId,
 sz.Id SizeId,
 buckets.Units,
 getdate()
from
 cteSizedInvoices src
 inner join [DestinationDatabase].dbo.InvoiceHeaders header on src.InvoiceNumber = header.Number
 inner join [DestinationDatabase].dbo.InvoiceDetails detail on src.InvoiceLine = detail.LineNumber
 and detail.InvoiceHeaderId = header.Id
 inner join [DestinationDatabase].dbo.SizeRanges ranges on src.SizeRangeCode = ranges.Code
 cross apply (select id SizeIndex, cast(item as int) Units from [DestinationDatabase].dbo.BucketString(src.UnitsPerSize, 5, 1, 1)) buckets
 inner join [DestinationDatabase].dbo.Sizes sz on buckets.SizeIndex = sz.SizeRangeIndex
 and sz.SizeRangeId = ranges.Id
 left join [DestinationDatabase].dbo.InvoiceDetailSizes dst
 on detail.Id = dst.InvoiceDetailId
 and sz.Id = dst.SizeId
where 
 buckets.Units <> 0
 and dst.id is null;

Here is the BucketString table-valued function that I'm using - it's adapted from this Stack Overflow answer, modified so that I could specify an "offset", because my data doesn't start at position 1:

create function [dbo].[BucketString] (
 @values varchar(max),
 @bucketSize int,
 @bufferSize int = 1,
 @offset int = 0)
returns @result table (id int, item varchar(max))
begin
 with buckets as
 ( 
 select 1 id
 union all
 select t.id + 1
 from buckets t
 where id = t.id 
 and t.id < len(@values)/(@bucketSize+@bufferSize)+1
 )
 insert into @result
 select 
 id, 
 substring(@values, @offset + ((id - 1) * (@bucketSize + @bufferSize) + (case when @bufferSize-1 = 0 then @bufferSize else @bufferSize-(@bufferSize-1) end)), @bucketSize) string
 from buckets 
 option (maxrecursion 0)
 return;
end

The query will not insert anything if the data already exists in the destination table, but in order to find out if it exists, I need to run the whole thing, so whether I'm inserting 1.5M rows, or 0, it's still ~10 minutes.

Can it be optimized in any way?

execution plan

Question 2

Have you tried replacing the CTE? A temp table or subquery may be faster. While CTE's are good for readability they are typically outpaced by alternative options.

Here is a temp table solution.

SELECT
src.f2 AS InvoiceNumber,
CAST(src.f5 AS INT) AS InvoiceLine,
src.f23 AS SizeRangeCode,
src.f19 AS UnitsPerSize
INTO #sizedInvoices
FROM Staging.dbo.[SourceTable] src
WHERE src.f1 = '01'
--INSERT INTO [DestinationDatabase].dbo.InvoiceDetailSizes (InvoiceDetailId, SizeId, Units, DateInserted)
SELECT
 detail.Id InvoiceDetailId,
 sz.Id SizeId,
 buckets.Units,
 GETDATE()
FROM #sizedInvoices src
INNER JOIN [DestinationDatabase].dbo.InvoiceHeaders header 
 ON src.InvoiceNumber = header.Number
INNER JOIN [DestinationDatabase].dbo.InvoiceDetails detail 
 ON src.InvoiceLine = detail.LineNumber
 AND detail.InvoiceHeaderId = header.Id
INNER JOIN [DestinationDatabase].dbo.SizeRanges ranges 
 ON src.SizeRangeCode = ranges.Code
CROSS APPLY 
(
 SELECT 
 id SizeIndex, 
 CAST(item AS INT) Units 
 FROM [DestinationDatabase].dbo.BucketString(src.UnitsPerSize, 5, 1, 1)
) buckets
INNER JOIN [DestinationDatabase].dbo.Sizes sz 
 ON buckets.SizeIndex = sz.SizeRangeIndex
 AND sz.SizeRangeId = ranges.Id
LEFT JOIN [DestinationDatabase].dbo.InvoiceDetailSizes dst
 ON detail.Id = dst.InvoiceDetailId
 AND sz.Id = dst.SizeId
WHERE buckets.Units <> 0
AND dst.id is null;
DROP TABLE #sizedInvoices

Here is a subquery solution.

--insert into [DestinationDatabase].dbo.InvoiceDetailSizes (InvoiceDetailId, SizeId, Units, DateInserted)
SELECT
 detail.Id InvoiceDetailId,
 sz.Id SizeId,
 buckets.Units,
 GETDATE()
FROM 
(
 SELECT
 src.f2 AS InvoiceNumber,
 CAST(src.f5 AS INT) AS InvoiceLine,
 src.f23 AS SizeRangeCode,
 src.f19 AS UnitsPerSize
 FROM Staging.dbo.[SourceTable] src
 WHERE src.f1 = '01'
) src
INNER JOIN [DestinationDatabase].dbo.InvoiceHeaders header 
 ON src.InvoiceNumber = header.Number
INNER JOIN [DestinationDatabase].dbo.InvoiceDetails detail 
 ON src.InvoiceLine = detail.LineNumber
 AND detail.InvoiceHeaderId = header.Id
INNER JOIN [DestinationDatabase].dbo.SizeRanges ranges 
 ON src.SizeRangeCode = ranges.Code
CROSS APPLY 
(
 SELECT 
 id SizeIndex, 
 CAST(item AS INT) Units 
 FROM [DestinationDatabase].dbo.BucketString(src.UnitsPerSize, 5, 1, 1)
) buckets
INNER JOIN [DestinationDatabase].dbo.Sizes sz 
 ON buckets.SizeIndex = sz.SizeRangeIndex
 AND sz.SizeRangeId = ranges.Id
LEFT JOIN [DestinationDatabase].dbo.InvoiceDetailSizes dst
 ON detail.Id = dst.InvoiceDetailId
 AND sz.Id = dst.SizeId
WHERE buckets.Units <> 0
AND dst.id is null;

In this situation, I would prefer the temp table solution. The answers to this question explain better than I ever could, but basically temp tables will work better with a large number of records. Other wise, it looks like generally well written SQL. I took the liberty of capitalizing keywords in these queries, as is conventional.

Question 3

Haven't tried it yet, but given the linked answer, I'm giving you the checkmark! Thanks!

Question 4

You use the value of (@bucketSize + @bufferSize) inside of a where clause and to calculate a value in a select statement, I think if you took that and created it's own variable, that it might give you some performance increase. as well as some of the other arithmetic that could be done from the start and not calculated during the actual query

len(@values)/(@bucketSize+@bufferSize)+1

could be declared from the beginning

DECLARE maxID
SET maxID = LEN(@values) / (@bucketSize + @bufferSize) + 1

After you do that it is a toss up whether or not to Declare another Variable for the Addition of the @bucketSize + @bufferSize, but it might be worth a try.

Also please use some spaces in your equations, you use plenty of white space everywhere else.

I have a minute or two so I will also say this...

since this is going over so many records and that simple Variable freed up a good 10 seconds, I think that it might be worth it to move as much static calculation out of the query as possible, I know this is less about what SQL was meant for, but I also think this is more about being Dynamic as well.

Maybe I am going a little too far here, but this should still do the same thing as the original

create function [dbo].[BucketString] (
 @values varchar(max),
 @bucketSize int,
 @bufferSize int = 1,
 @offset int = 0)
returns @result table (id int, item varchar(max))
begin
 DECLARE @bucketAndBufferSize
 DECLARE @maxID
 DECLARE @bufferMinusBufferMinusOne
 DECLARE @bufferMinusOne
 SET @bucketAndBufferSize = @bucketSize+@bufferSize
 SET @maxID = LEN(@values) / (@bucketAndBufferSize) + 1
 SET @bufferMinusBufferMinusOne = @bufferSize - (@bufferSize - 1) 
 SET @bufferMinusOne = @bufferSize - 1
 with buckets as
 ( 
 select 1 id
 union all
 select t.id + 1
 from buckets t
 where id = t.id 
 and t.id < @maxID
 )
 insert into @result
 select 
 id, 
 substring(@values, @offset + ((id - 1) * @bucketAndBufferSize + (case when @bufferMinusOne = 0 then 1 else @bufferMinusBufferMinusOne end)), @bucketSize) string
 from buckets 
 option (maxrecursion 0)
 return;
end

I apologize for the mismatched casing, it's habit to capitalized those words...

When bufferSize is unchanged you want it to be 1 so let's just do that and take out the extra call to the variable.

I took out all the arithmetic that wasn't reliant on information from the query and made them their own variable, this isn't going to be a huge difference, but I think it will be faster than doing the arithmetic inside the query itself, it's like being distracted by that red ball all the time, it distracts you momentarily but you get the job done.

Question 5

That shaved off a whole... 5-10 seconds! :)

Question 6

I don't see much that could improve performance other than what @Malachi said. I did notice that (at least to my preference, completely subjective) your code would read easier with more vertical white space, especially in your join conditions and subqueries.

Here is how I personally would format it:

with cteSizedInvoices (
 InvoiceNumber, InvoiceLine, SizeRangeCode, UnitsPerSize
)
as (select
 src.f2,
 cast(src.f5 as int),
 src.f23,
 src.f19
 from Staging.dbo.[SourceTable] src
 where src.f1 = '01'
)
--insert into [DestinationDatabase].dbo.InvoiceDetailSizes (InvoiceDetailId, SizeId, Units, DateInserted)
select
 detail.Id InvoiceDetailId,
 sz.Id SizeId,
 buckets.Units,
 getdate()
from
 cteSizedInvoices src
 inner join [DestinationDatabase].dbo.InvoiceHeaders header 
 on src.InvoiceNumber = header.Number
 inner join [DestinationDatabase].dbo.InvoiceDetails detail 
 on src.InvoiceLine = detail.LineNumber
 and detail.InvoiceHeaderId = header.Id
 inner join [DestinationDatabase].dbo.SizeRanges ranges 
 on src.SizeRangeCode = ranges.Code
 cross apply (
 select id SizeIndex, 
 cast(item as int) Units 
 from [DestinationDatabase].dbo.BucketString(src.UnitsPerSize, 5, 1, 1)
 ) buckets
 inner join [DestinationDatabase].dbo.Sizes sz 
 on buckets.SizeIndex = sz.SizeRangeIndex
 and sz.SizeRangeId = ranges.Id
 left join [DestinationDatabase].dbo.InvoiceDetailSizes dst
 on detail.Id = dst.InvoiceDetailId
 and sz.Id = dst.SizeId
where 
 buckets.Units <> 0
 and dst.id is null;

PenutReaper PenutReaper 1,33110 silver badges8 bronze badges · Accepted Answer · 2014-09-30 11:00:04Z

Have you tried replacing the CTE? A temp table or subquery may be faster. While CTE's are good for readability they are typically outpaced by alternative options.

Here is a temp table solution.

SELECT
src.f2 AS InvoiceNumber,
CAST(src.f5 AS INT) AS InvoiceLine,
src.f23 AS SizeRangeCode,
src.f19 AS UnitsPerSize
INTO #sizedInvoices
FROM Staging.dbo.[SourceTable] src
WHERE src.f1 = '01'
--INSERT INTO [DestinationDatabase].dbo.InvoiceDetailSizes (InvoiceDetailId, SizeId, Units, DateInserted)
SELECT
 detail.Id InvoiceDetailId,
 sz.Id SizeId,
 buckets.Units,
 GETDATE()
FROM #sizedInvoices src
INNER JOIN [DestinationDatabase].dbo.InvoiceHeaders header 
 ON src.InvoiceNumber = header.Number
INNER JOIN [DestinationDatabase].dbo.InvoiceDetails detail 
 ON src.InvoiceLine = detail.LineNumber
 AND detail.InvoiceHeaderId = header.Id
INNER JOIN [DestinationDatabase].dbo.SizeRanges ranges 
 ON src.SizeRangeCode = ranges.Code
CROSS APPLY 
(
 SELECT 
 id SizeIndex, 
 CAST(item AS INT) Units 
 FROM [DestinationDatabase].dbo.BucketString(src.UnitsPerSize, 5, 1, 1)
) buckets
INNER JOIN [DestinationDatabase].dbo.Sizes sz 
 ON buckets.SizeIndex = sz.SizeRangeIndex
 AND sz.SizeRangeId = ranges.Id
LEFT JOIN [DestinationDatabase].dbo.InvoiceDetailSizes dst
 ON detail.Id = dst.InvoiceDetailId
 AND sz.Id = dst.SizeId
WHERE buckets.Units <> 0
AND dst.id is null;
DROP TABLE #sizedInvoices

Here is a subquery solution.

--insert into [DestinationDatabase].dbo.InvoiceDetailSizes (InvoiceDetailId, SizeId, Units, DateInserted)
SELECT
 detail.Id InvoiceDetailId,
 sz.Id SizeId,
 buckets.Units,
 GETDATE()
FROM 
(
 SELECT
 src.f2 AS InvoiceNumber,
 CAST(src.f5 AS INT) AS InvoiceLine,
 src.f23 AS SizeRangeCode,
 src.f19 AS UnitsPerSize
 FROM Staging.dbo.[SourceTable] src
 WHERE src.f1 = '01'
) src
INNER JOIN [DestinationDatabase].dbo.InvoiceHeaders header 
 ON src.InvoiceNumber = header.Number
INNER JOIN [DestinationDatabase].dbo.InvoiceDetails detail 
 ON src.InvoiceLine = detail.LineNumber
 AND detail.InvoiceHeaderId = header.Id
INNER JOIN [DestinationDatabase].dbo.SizeRanges ranges 
 ON src.SizeRangeCode = ranges.Code
CROSS APPLY 
(
 SELECT 
 id SizeIndex, 
 CAST(item AS INT) Units 
 FROM [DestinationDatabase].dbo.BucketString(src.UnitsPerSize, 5, 1, 1)
) buckets
INNER JOIN [DestinationDatabase].dbo.Sizes sz 
 ON buckets.SizeIndex = sz.SizeRangeIndex
 AND sz.SizeRangeId = ranges.Id
LEFT JOIN [DestinationDatabase].dbo.InvoiceDetailSizes dst
 ON detail.Id = dst.InvoiceDetailId
 AND sz.Id = dst.SizeId
WHERE buckets.Units <> 0
AND dst.id is null;

In this situation, I would prefer the temp table solution. The answers to this question explain better than I ever could, but basically temp tables will work better with a large number of records. Other wise, it looks like generally well written SQL. I took the liberty of capitalizing keywords in these queries, as is conventional.

Haven't tried it yet, but given the linked answer, I'm giving you the checkmark! Thanks!

Stack Exchange Network

Normalizing Invoices

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Normalizing Invoices

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions