How to avoid using Merge query when upserting multiple data using xml parameter?

Question 1

I am trying to update a table with an array of values. Each item in the array contains information that matches a row in a table in the SQL Server database. If the row already exists in the table, we update that row with the information in the given array. Else, we insert a new row in the table. I have basically described upsert.

Now, I am trying to achieve this in a stored procedure that takes an XML parameter. The reason I am using XML and not table-valued param is because, doing the latter, I will have to create custom type in SQL and associate this type with the stored procedure. If I ever changed something in my stored procedure or my db schema down the road, I would have to redo both the stored procedure and the custom type. I want to avoid this situation. Besides, the superiority that TVP has over XML is not useful for my situation because, my data array size will never exceed 1000. This means I cannot use the solution proposed here: How to insert multiple records using XML in SQL server 2008

Also, a similar discussion here (UPSERT - Is there a better alternative to MERGE or @@rowcount?) is different from what I am asking because, I am trying to upsert multiple rows to a table.

I was hoping that I would simply use the following set of queries to upsert the values from the xml. But this is not going to work. This approach is just supposed to work when the input is a single row.

begin tran
 update table with (serializable) set select * from xml_param
 where key = @key
 if @@rowcount = 0
 begin
 insert table (key, ...) values (@key,..)
 end
commit tran

Next alternative is to use an exhaustive IF EXISTS or one of its variations of the following form. But, I reject this on the ground of being of sub-optimal efficiency:

IF (SELECT COUNT ... ) > 0
 UPDATE
ELSE
 INSERT

The next option was using Merge statement as described here: http://www.databasejournal.com/features/mssql/using-the-merge-statement-to-perform-an-upsert.html. But, then I read about issues with Merge query here: http://www.mssqltips.com/sqlservertip/3074/use-caution-with-sql-servers-merge-statement/. For this reason, I am trying to avoid Merge.

So, now my question is: is there any other option or a better way to achieve multiple upsert using XML parameter in SQL Server 2008 stored procedure?

Please note that the data in the XML parameter may contain some records that should not be UPSERTed due to being older than the current record. There is a ModifiedDate field in both the XML and the destination table that needs to be compared in order to determine if the record should be updated or discarded.

Question 2

Trying to avoid making changes to the proc in future is not really a good reason to not use a TVP. if the data passed in changes you'll end up making changes to the code either way.

Question 3

@MaxVernon I had the same thought at first and almost made a very similar comment because that alone is not a reason to avoid TVP. But they do take a little more effort, and with the caveat of "never over 1000 rows" (implied sometimes, or maybe even often?) it's a bit of a toss-up. However, I suppose I should qualify my answer to state that < 1000 rows at a time isn't too much different from XML as long as it is not called 10k times in a row. Then minor performance differences certainly do add up.

Question 4

The issues with MERGE that Bertrand points out are mostly edge cases and inefficiencies, not show stoppers - MS wouldn't have released it if it was a real minefield. Are you sure that the convolutions you're going through to avoid MERGE are not creating more potential error than they're saving?

Question 5

@JonofAllTrades To be fair, what I proposed is not really that convoluted compared to MERGE. The INSERT and UPDATE steps of MERGE are still processed separately. The main difference in my approach is the table variable that holds the updated record IDs and the DELETE query that uses that table variable to remove those records from the temp table of the incoming data. And I suppose the SOURCE could be direct from @XMLparam.nodes() instead of dumping to a temp table, but still, that's not a lot of extra stuff to not have to worry about ever finding yourself in one of those edge cases ;-).

Question 6

Whether the source is XML or a TVP does not make a huge difference. The overall operation is essentially:

UPDATE existing rows
INSERT missing rows

You do it in that order because if you INSERT first, then all rows exist to get the UPDATE and you will do repeated work for any rows that were just inserted.

Beyond that there are different ways to accomplish this and various ways to tweak some additional efficiency out of it.

Let's start with the bare minimum. Since extracting the XML is likely to be one of the more expensive parts of this operation (if not the most expensive), we don't want to have to do that twice (as we have two operations to perform). So, we create a temp table and extract the data out of the XML into it:

CREATE TABLE #TempImport
(
 Field1 DataType1,
 Field2 DataType2,
 ...
);
INSERT INTO #TempImport (Field1, Field2, ...)
 SELECT tab.col.value('XQueryForField1', 'DataType') AS [Field1],
 tab.col.value('XQueryForField2', 'DataType') AS [Field2],
 ...
 FROM @XmlInputParam.nodes('XQuery') tab(col);

From there we do the UPDATE and then the INSERT:

UPDATE tab
SET tab.Field1 = tmp.Field1,
 tab.Field2 = tmp.Field2,
 ...
FROM [SchemaName].[TableName] tab
INNER JOIN #TempImport tmp
 ON tmp.IDField = tab.IDField
 ... -- more fields if PK or alternate key is composite
INSERT INTO [SchemaName].[TableName]
 (Field1, Field2, ...)
 SELECT tmp.Field1, tmp.Field2, ...
 FROM #TempImport tmp
 WHERE NOT EXISTS (
 SELECT *
 FROM [SchemaName].[TableName] tab
 WHERE tab.IDField = tmp.IDField
 ... -- more fields if PK or alternate key is composite
 );

Now that we have the basic operation down, we can do a few things to optimize:

capture @@ROWCOUNT of insert into temp table and compare to @@ROWCOUNT of the UPDATE. If they are the same then we can skip the INSERT
capture the ID values updated via the OUTPUT clause and DELETE those from the temp table. Then the INSERT doesn't need the WHERE NOT EXISTS(...)
IF there are any rows in the incoming data that should not be synced (i.e. neither inserted nor updated), then those records should be removed prior to doing the UPDATE

CREATE TABLE #TempImport
(
 Field1 DataType1,
 Field2 DataType2,
 ...
);
DECLARE @ImportRows INT;
DECLARE @UpdatedIDs TABLE ([IDField] INT NOT NULL);
BEGIN TRY
 INSERT INTO #TempImport (Field1, Field2, ...)
 SELECT tab.col.value('XQueryForField1', 'DataType') AS [Field1],
 tab.col.value('XQueryForField2', 'DataType') AS [Field2],
 ...
 FROM @XmlInputParam.nodes('XQuery') tab(col);
 SET @ImportRows = @@ROWCOUNT;
 IF (@ImportRows = 0)
 BEGIN
 RAISERROR('Seriously?', 16, 1); -- no rows to import
 END;
 -- optional: test to see if it helps or hurts
 -- ALTER TABLE #TempImport
 -- ADD CONSTRAINT [PK_#TempImport]
 -- PRIMARY KEY CLUSTERED (PKField ASC)
 -- WITH FILLFACTOR = 100;
 -- optional: remove any records that should not be synced
 DELETE tmp
 FROM #TempImport tmp
 INNER JOIN [SchemaName].[TableName] tab
 ON tab.IDField = tmp.IDField
 ... -- more fields if PK or alternate key is composite
 WHERE tmp.ModifiedDate < tab.ModifiedDate;
 BEGIN TRAN;
 UPDATE tab
 SET tab.Field1 = tmp.Field1,
 tab.Field2 = tmp.Field2,
 ...
 OUTPUT INSERTED.IDField
 INTO @UpdatedIDs ([IDField]) -- capture IDs that are updated
 FROM [SchemaName].[TableName] tab
 INNER JOIN #TempImport tmp
 ON tmp.IDField = tab.IDField
 ... -- more fields if PK or alternate key is composite
 IF (@@ROWCOUNT < @ImportRows) -- if all rows were updates then skip, else insert remaining
 BEGIN
 -- get rid of rows that were updates, leaving only the ones to insert
 DELETE tmp
 FROM #TempImport tmp
 INNER JOIN @UpdatedIDs del
 ON del.[IDField] = tmp.[IDField];
 -- OR, rather than the DELETE, maybe add a column to #TempImport for:
 -- [IsUpdate] BIT NOT NULL DEFAULT (0)
 -- Then UPDATE #TempImport SET [IsUpdate] = 1 JOIN @UpdatedIDs ON [IDField]
 -- Then, in below INSERT, add: WHERE [IsUpdate] = 0
 INSERT INTO [SchemaName].[TableName]
 (Field1, Field2, ...)
 SELECT tmp.Field1, tmp.Field2, ...
 FROM #TempImport tmp
 END;
 COMMIT TRAN;
END TRY
BEGIN CATCH
 IF (@@TRANCOUNT > 0)
 BEGIN
 ROLLBACK;
 END;
 -- THROW; -- if using SQL 2012 or newer, use this and remove the following 3 lines
 DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE();
 RAISERROR(@ErrorMessage, 16, 1);
 RETURN;
END CATCH;

I have used this model several times on Imports / ETLs that either have well over 1000 rows or maybe 500 in a batch out of a total set of 20k - over a million rows. However, I have not tested the performance difference between the DELETE of the updated rows out of the temp table vs just updating the [IsUpdate] field.

Please note regarding the decision to use XML over TVP due to there being, at most, 1000 rows to import at a time (mentioned in the question):

If this is being called a few times here and there, then quite possibly the minor performance gain in TVP might not be worth the additional maintenance cost (needing to drop the proc before changing the User-Defined Table Type, app code changes, etc). But if you are importing 4 million rows, sending 1000 at a time, that is 4000 executions (and 4 million rows of XML to parse no matter how it is broken up), and even a minor performance difference when executed only a few times will add up to a noticeable difference.

That being said, the method as I have described does not change outside of replacing the SELECT FROM @XmlInputParam to be SELECT FROM @TVP. Since TVPs are read-only, you wouldn't be able to delete from them. I guess you could simply add a WHERE NOT EXISTS(SELECT * FROM @UpdateIDs ids WHERE ids.IDField = tmp.IDField) to that final SELECT (tied to the INSERT) instead of the simple WHERE IsUpdate = 0. If you were to use the @UpdateIDs table variable in this manner, then you could even get away with not dumping the incoming rows into the temp table.

score 13 · Accepted Answer · 2015-01-17 03:05:52Z

Whether the source is XML or a TVP does not make a huge difference. The overall operation is essentially:

UPDATE existing rows
INSERT missing rows

You do it in that order because if you INSERT first, then all rows exist to get the UPDATE and you will do repeated work for any rows that were just inserted.

Beyond that there are different ways to accomplish this and various ways to tweak some additional efficiency out of it.

Let's start with the bare minimum. Since extracting the XML is likely to be one of the more expensive parts of this operation (if not the most expensive), we don't want to have to do that twice (as we have two operations to perform). So, we create a temp table and extract the data out of the XML into it:

CREATE TABLE #TempImport
(
 Field1 DataType1,
 Field2 DataType2,
 ...
);
INSERT INTO #TempImport (Field1, Field2, ...)
 SELECT tab.col.value('XQueryForField1', 'DataType') AS [Field1],
 tab.col.value('XQueryForField2', 'DataType') AS [Field2],
 ...
 FROM @XmlInputParam.nodes('XQuery') tab(col);

From there we do the UPDATE and then the INSERT:

UPDATE tab
SET tab.Field1 = tmp.Field1,
 tab.Field2 = tmp.Field2,
 ...
FROM [SchemaName].[TableName] tab
INNER JOIN #TempImport tmp
 ON tmp.IDField = tab.IDField
 ... -- more fields if PK or alternate key is composite
INSERT INTO [SchemaName].[TableName]
 (Field1, Field2, ...)
 SELECT tmp.Field1, tmp.Field2, ...
 FROM #TempImport tmp
 WHERE NOT EXISTS (
 SELECT *
 FROM [SchemaName].[TableName] tab
 WHERE tab.IDField = tmp.IDField
 ... -- more fields if PK or alternate key is composite
 );

Now that we have the basic operation down, we can do a few things to optimize:

capture @@ROWCOUNT of insert into temp table and compare to @@ROWCOUNT of the UPDATE. If they are the same then we can skip the INSERT
capture the ID values updated via the OUTPUT clause and DELETE those from the temp table. Then the INSERT doesn't need the WHERE NOT EXISTS(...)
IF there are any rows in the incoming data that should not be synced (i.e. neither inserted nor updated), then those records should be removed prior to doing the UPDATE

CREATE TABLE #TempImport
(
 Field1 DataType1,
 Field2 DataType2,
 ...
);
DECLARE @ImportRows INT;
DECLARE @UpdatedIDs TABLE ([IDField] INT NOT NULL);
BEGIN TRY
 INSERT INTO #TempImport (Field1, Field2, ...)
 SELECT tab.col.value('XQueryForField1', 'DataType') AS [Field1],
 tab.col.value('XQueryForField2', 'DataType') AS [Field2],
 ...
 FROM @XmlInputParam.nodes('XQuery') tab(col);
 SET @ImportRows = @@ROWCOUNT;
 IF (@ImportRows = 0)
 BEGIN
 RAISERROR('Seriously?', 16, 1); -- no rows to import
 END;
 -- optional: test to see if it helps or hurts
 -- ALTER TABLE #TempImport
 -- ADD CONSTRAINT [PK_#TempImport]
 -- PRIMARY KEY CLUSTERED (PKField ASC)
 -- WITH FILLFACTOR = 100;
 -- optional: remove any records that should not be synced
 DELETE tmp
 FROM #TempImport tmp
 INNER JOIN [SchemaName].[TableName] tab
 ON tab.IDField = tmp.IDField
 ... -- more fields if PK or alternate key is composite
 WHERE tmp.ModifiedDate < tab.ModifiedDate;
 BEGIN TRAN;
 UPDATE tab
 SET tab.Field1 = tmp.Field1,
 tab.Field2 = tmp.Field2,
 ...
 OUTPUT INSERTED.IDField
 INTO @UpdatedIDs ([IDField]) -- capture IDs that are updated
 FROM [SchemaName].[TableName] tab
 INNER JOIN #TempImport tmp
 ON tmp.IDField = tab.IDField
 ... -- more fields if PK or alternate key is composite
 IF (@@ROWCOUNT < @ImportRows) -- if all rows were updates then skip, else insert remaining
 BEGIN
 -- get rid of rows that were updates, leaving only the ones to insert
 DELETE tmp
 FROM #TempImport tmp
 INNER JOIN @UpdatedIDs del
 ON del.[IDField] = tmp.[IDField];
 -- OR, rather than the DELETE, maybe add a column to #TempImport for:
 -- [IsUpdate] BIT NOT NULL DEFAULT (0)
 -- Then UPDATE #TempImport SET [IsUpdate] = 1 JOIN @UpdatedIDs ON [IDField]
 -- Then, in below INSERT, add: WHERE [IsUpdate] = 0
 INSERT INTO [SchemaName].[TableName]
 (Field1, Field2, ...)
 SELECT tmp.Field1, tmp.Field2, ...
 FROM #TempImport tmp
 END;
 COMMIT TRAN;
END TRY
BEGIN CATCH
 IF (@@TRANCOUNT > 0)
 BEGIN
 ROLLBACK;
 END;
 -- THROW; -- if using SQL 2012 or newer, use this and remove the following 3 lines
 DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE();
 RAISERROR(@ErrorMessage, 16, 1);
 RETURN;
END CATCH;

I have used this model several times on Imports / ETLs that either have well over 1000 rows or maybe 500 in a batch out of a total set of 20k - over a million rows. However, I have not tested the performance difference between the DELETE of the updated rows out of the temp table vs just updating the [IsUpdate] field.

Please note regarding the decision to use XML over TVP due to there being, at most, 1000 rows to import at a time (mentioned in the question):

If this is being called a few times here and there, then quite possibly the minor performance gain in TVP might not be worth the additional maintenance cost (needing to drop the proc before changing the User-Defined Table Type, app code changes, etc). But if you are importing 4 million rows, sending 1000 at a time, that is 4000 executions (and 4 million rows of XML to parse no matter how it is broken up), and even a minor performance difference when executed only a few times will add up to a noticeable difference.

That being said, the method as I have described does not change outside of replacing the SELECT FROM @XmlInputParam to be SELECT FROM @TVP. Since TVPs are read-only, you wouldn't be able to delete from them. I guess you could simply add a WHERE NOT EXISTS(SELECT * FROM @UpdateIDs ids WHERE ids.IDField = tmp.IDField) to that final SELECT (tied to the INSERT) instead of the simple WHERE IsUpdate = 0. If you were to use the @UpdateIDs table variable in this manner, then you could even get away with not dumping the incoming rows into the temp table.

Stack Exchange Network

How to avoid using Merge query when upserting multiple data using xml parameter?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

How to avoid using Merge query when upserting multiple data using xml parameter?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions