I need to move data from one database to another, and since I don't have SSIS I'm doing this ETL with T-SQL scripts.
One of the source tables contains invoice details, and features a column that contains the number of units invoiced per size (it's clothing); as I transfer the data over to my database, I'm normalizing this information. The query works fine, the execution plan looks exactly as I expected, doesn't recommend adding any indexes, ...but it takes about 10 minutes to process.
The source table Staging.dbo.[SourceTable]
contains about 780K rows; the query inserts 1.5M rows into the destination table [DestinationDatabase].dbo.InvoiceDetailSizes
.
The query is scheduled to run daily, as part of an overnight process - the 10 minutes don't really matter, but still I want to be sure everything is as efficient as it can be.
with cteSizedInvoices (
InvoiceNumber, InvoiceLine, SizeRangeCode, UnitsPerSize
)
as (select
src.f2,
cast(src.f5 as int),
src.f23,
src.f19
from Staging.dbo.[SourceTable] src
where src.f1 = '01'
)
--insert into [DestinationDatabase].dbo.InvoiceDetailSizes (InvoiceDetailId, SizeId, Units, DateInserted)
select
detail.Id InvoiceDetailId,
sz.Id SizeId,
buckets.Units,
getdate()
from
cteSizedInvoices src
inner join [DestinationDatabase].dbo.InvoiceHeaders header on src.InvoiceNumber = header.Number
inner join [DestinationDatabase].dbo.InvoiceDetails detail on src.InvoiceLine = detail.LineNumber
and detail.InvoiceHeaderId = header.Id
inner join [DestinationDatabase].dbo.SizeRanges ranges on src.SizeRangeCode = ranges.Code
cross apply (select id SizeIndex, cast(item as int) Units from [DestinationDatabase].dbo.BucketString(src.UnitsPerSize, 5, 1, 1)) buckets
inner join [DestinationDatabase].dbo.Sizes sz on buckets.SizeIndex = sz.SizeRangeIndex
and sz.SizeRangeId = ranges.Id
left join [DestinationDatabase].dbo.InvoiceDetailSizes dst
on detail.Id = dst.InvoiceDetailId
and sz.Id = dst.SizeId
where
buckets.Units <> 0
and dst.id is null;
Here is the BucketString
table-valued function that I'm using - it's adapted from this Stack Overflow answer, modified so that I could specify an "offset", because my data doesn't start at position 1:
create function [dbo].[BucketString] (
@values varchar(max),
@bucketSize int,
@bufferSize int = 1,
@offset int = 0)
returns @result table (id int, item varchar(max))
begin
with buckets as
(
select 1 id
union all
select t.id + 1
from buckets t
where id = t.id
and t.id < len(@values)/(@bucketSize+@bufferSize)+1
)
insert into @result
select
id,
substring(@values, @offset + ((id - 1) * (@bucketSize + @bufferSize) + (case when @bufferSize-1 = 0 then @bufferSize else @bufferSize-(@bufferSize-1) end)), @bucketSize) string
from buckets
option (maxrecursion 0)
return;
end
The query will not insert anything if the data already exists in the destination table, but in order to find out if it exists, I need to run the whole thing, so whether I'm inserting 1.5M rows, or 0, it's still ~10 minutes.
Can it be optimized in any way?
3 Answers 3
Have you tried replacing the CTE? A temp table or subquery may be faster. While CTE's are good for readability they are typically outpaced by alternative options.
Here is a temp table solution.
SELECT
src.f2 AS InvoiceNumber,
CAST(src.f5 AS INT) AS InvoiceLine,
src.f23 AS SizeRangeCode,
src.f19 AS UnitsPerSize
INTO #sizedInvoices
FROM Staging.dbo.[SourceTable] src
WHERE src.f1 = '01'
--INSERT INTO [DestinationDatabase].dbo.InvoiceDetailSizes (InvoiceDetailId, SizeId, Units, DateInserted)
SELECT
detail.Id InvoiceDetailId,
sz.Id SizeId,
buckets.Units,
GETDATE()
FROM #sizedInvoices src
INNER JOIN [DestinationDatabase].dbo.InvoiceHeaders header
ON src.InvoiceNumber = header.Number
INNER JOIN [DestinationDatabase].dbo.InvoiceDetails detail
ON src.InvoiceLine = detail.LineNumber
AND detail.InvoiceHeaderId = header.Id
INNER JOIN [DestinationDatabase].dbo.SizeRanges ranges
ON src.SizeRangeCode = ranges.Code
CROSS APPLY
(
SELECT
id SizeIndex,
CAST(item AS INT) Units
FROM [DestinationDatabase].dbo.BucketString(src.UnitsPerSize, 5, 1, 1)
) buckets
INNER JOIN [DestinationDatabase].dbo.Sizes sz
ON buckets.SizeIndex = sz.SizeRangeIndex
AND sz.SizeRangeId = ranges.Id
LEFT JOIN [DestinationDatabase].dbo.InvoiceDetailSizes dst
ON detail.Id = dst.InvoiceDetailId
AND sz.Id = dst.SizeId
WHERE buckets.Units <> 0
AND dst.id is null;
DROP TABLE #sizedInvoices
Here is a subquery solution.
--insert into [DestinationDatabase].dbo.InvoiceDetailSizes (InvoiceDetailId, SizeId, Units, DateInserted)
SELECT
detail.Id InvoiceDetailId,
sz.Id SizeId,
buckets.Units,
GETDATE()
FROM
(
SELECT
src.f2 AS InvoiceNumber,
CAST(src.f5 AS INT) AS InvoiceLine,
src.f23 AS SizeRangeCode,
src.f19 AS UnitsPerSize
FROM Staging.dbo.[SourceTable] src
WHERE src.f1 = '01'
) src
INNER JOIN [DestinationDatabase].dbo.InvoiceHeaders header
ON src.InvoiceNumber = header.Number
INNER JOIN [DestinationDatabase].dbo.InvoiceDetails detail
ON src.InvoiceLine = detail.LineNumber
AND detail.InvoiceHeaderId = header.Id
INNER JOIN [DestinationDatabase].dbo.SizeRanges ranges
ON src.SizeRangeCode = ranges.Code
CROSS APPLY
(
SELECT
id SizeIndex,
CAST(item AS INT) Units
FROM [DestinationDatabase].dbo.BucketString(src.UnitsPerSize, 5, 1, 1)
) buckets
INNER JOIN [DestinationDatabase].dbo.Sizes sz
ON buckets.SizeIndex = sz.SizeRangeIndex
AND sz.SizeRangeId = ranges.Id
LEFT JOIN [DestinationDatabase].dbo.InvoiceDetailSizes dst
ON detail.Id = dst.InvoiceDetailId
AND sz.Id = dst.SizeId
WHERE buckets.Units <> 0
AND dst.id is null;
In this situation, I would prefer the temp table solution. The answers to this question explain better than I ever could, but basically temp tables will work better with a large number of records. Other wise, it looks like generally well written SQL. I took the liberty of capitalizing keywords in these queries, as is conventional.
-
\$\begingroup\$ Haven't tried it yet, but given the linked answer, I'm giving you the checkmark! Thanks! \$\endgroup\$Mathieu Guindon– Mathieu Guindon2014年09月30日 11:06:50 +00:00Commented Sep 30, 2014 at 11:06
You use the value of (@bucketSize + @bufferSize)
inside of a where clause and to calculate a value in a select statement, I think if you took that and created it's own variable, that it might give you some performance increase. as well as some of the other arithmetic that could be done from the start and not calculated during the actual query
len(@values)/(@bucketSize+@bufferSize)+1
could be declared from the beginning
DECLARE maxID
SET maxID = LEN(@values) / (@bucketSize + @bufferSize) + 1
After you do that it is a toss up whether or not to Declare another Variable for the Addition of the @bucketSize + @bufferSize
, but it might be worth a try.
Also please use some spaces in your equations, you use plenty of white space everywhere else.
I have a minute or two so I will also say this...
since this is going over so many records and that simple Variable freed up a good 10 seconds, I think that it might be worth it to move as much static calculation out of the query as possible, I know this is less about what SQL was meant for, but I also think this is more about being Dynamic as well.
Maybe I am going a little too far here, but this should still do the same thing as the original
create function [dbo].[BucketString] (
@values varchar(max),
@bucketSize int,
@bufferSize int = 1,
@offset int = 0)
returns @result table (id int, item varchar(max))
begin
DECLARE @bucketAndBufferSize
DECLARE @maxID
DECLARE @bufferMinusBufferMinusOne
DECLARE @bufferMinusOne
SET @bucketAndBufferSize = @bucketSize+@bufferSize
SET @maxID = LEN(@values) / (@bucketAndBufferSize) + 1
SET @bufferMinusBufferMinusOne = @bufferSize - (@bufferSize - 1)
SET @bufferMinusOne = @bufferSize - 1
with buckets as
(
select 1 id
union all
select t.id + 1
from buckets t
where id = t.id
and t.id < @maxID
)
insert into @result
select
id,
substring(@values, @offset + ((id - 1) * @bucketAndBufferSize + (case when @bufferMinusOne = 0 then 1 else @bufferMinusBufferMinusOne end)), @bucketSize) string
from buckets
option (maxrecursion 0)
return;
end
I apologize for the mismatched casing, it's habit to capitalized those words...
When bufferSize
is unchanged you want it to be 1 so let's just do that and take out the extra call to the variable.
I took out all the arithmetic that wasn't reliant on information from the query and made them their own variable, this isn't going to be a huge difference, but I think it will be faster than doing the arithmetic inside the query itself, it's like being distracted by that red ball all the time, it distracts you momentarily but you get the job done.
-
3\$\begingroup\$ That shaved off a whole... 5-10 seconds! :) \$\endgroup\$Mathieu Guindon– Mathieu Guindon2014年09月29日 18:48:06 +00:00Commented Sep 29, 2014 at 18:48
I don't see much that could improve performance other than what @Malachi said. I did notice that (at least to my preference, completely subjective) your code would read easier with more vertical white space, especially in your join
conditions and subqueries.
Here is how I personally would format it:
with cteSizedInvoices (
InvoiceNumber, InvoiceLine, SizeRangeCode, UnitsPerSize
)
as (select
src.f2,
cast(src.f5 as int),
src.f23,
src.f19
from Staging.dbo.[SourceTable] src
where src.f1 = '01'
)
--insert into [DestinationDatabase].dbo.InvoiceDetailSizes (InvoiceDetailId, SizeId, Units, DateInserted)
select
detail.Id InvoiceDetailId,
sz.Id SizeId,
buckets.Units,
getdate()
from
cteSizedInvoices src
inner join [DestinationDatabase].dbo.InvoiceHeaders header
on src.InvoiceNumber = header.Number
inner join [DestinationDatabase].dbo.InvoiceDetails detail
on src.InvoiceLine = detail.LineNumber
and detail.InvoiceHeaderId = header.Id
inner join [DestinationDatabase].dbo.SizeRanges ranges
on src.SizeRangeCode = ranges.Code
cross apply (
select id SizeIndex,
cast(item as int) Units
from [DestinationDatabase].dbo.BucketString(src.UnitsPerSize, 5, 1, 1)
) buckets
inner join [DestinationDatabase].dbo.Sizes sz
on buckets.SizeIndex = sz.SizeRangeIndex
and sz.SizeRangeId = ranges.Id
left join [DestinationDatabase].dbo.InvoiceDetailSizes dst
on detail.Id = dst.InvoiceDetailId
and sz.Id = dst.SizeId
where
buckets.Units <> 0
and dst.id is null;