SQL Server : Merge and update multiple rows across columns

Question 1

I am trying to consolidate data in one table and merge it into another.

The data in the ~~(削除) source (削除ここまで)~~ destination table is as follows :

name |dob |city |occupation
-----------------------------------------------------------------------------------
galileo-galilei |1900年01月01日 00:00:00.000 |rome |polymath
issac-newton |1900年01月01日 00:00:00.000 |london |mathematician-scientist
leonardo-da-cinci |1900年01月03日 00:00:00.000 |rome |polymath

The data in the ~~(削除) destination (削除ここまで)~~ source table is :

sl_no |name |dob |city |occupation
-----------------------------------------------------------------------------
1 |galileo-galilei |1900年01月01日 00:00:00.000 | |
2 |galileo-galilei |1900年01月02日 00:00:00.000 |venice |
3 |galileo-galilei |1900年01月05日 00:00:00.000 | |astronomer

The expected result in the destination table is :

name |dob |city |occupation
-----------------------------------------------------------------------------------
galileo-galilei |1900年01月05日 00:00:00.000 |venice |astronomer
issac-newton |1900年01月01日 00:00:00.000 |london |mathematician-scientist
leonardo-da-cinci |1900年01月03日 00:00:00.000 |rome |polymath

My attempts using update-with-join and merge have not been successful.

Update-With-Join :

-- updates data from the first match only
update p
set p.city = s.city,
p.occupation = s.occupation
from person_update_with_join_test_primary p, person_update_with_join_test_secondary s
where p.name = s.name ;

Merge :

-- https://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx
/*
The MERGE statement attempted to UPDATE or DELETE the same row more than once. 
This happens when a target row matches more than one source row. 
A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. 
Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
*/
begin
merge person_update_with_join_test_primary as p
using person_update_with_join_test_secondary as s
on (p.name = s.name)
when not matched by target 
then insert (name, dob, city, occupation) 
values (s.name, s.dob, s.city, s.occupation)
when matched 
then update set p.dob = s.dob 
, p.city=(case when (len(s.city)>0) then s.city else p.city end)
, p.occupation=(case when (len(s.occupation)>0) then s.occupation else p.occupation end)
output $action, inserted.*, deleted.*;
end

I believe what I am looking for is similar to what has been posted here and here. However its not quite what I am looking for.

Are there any other ways to accomplish this other than using a cursor and an upsert (assuming that works)?

Update #1 :

Basically as long as the values in the source are not empty the most recent values (having highest id value) in the source are expected to be merged into the destination.

Eg : For #3 row in the source, the city column would not be considered to be merged into destination. Similarly for #2 , the occupation column would not be considered for being merged into destination. The the column name is a primary-key in the destination table.

I am trying to achieve the same state in the destination table as would have been expected if I were to iterate over the source data and update only the non-empty values in the destination - by using a query instead of doing it through application.

Question 2

I see 4 dob values, in 2 tables, for galilleo. 2 different city values (rome, venice), 2 different occupations (polymath, astronomer. How the specific ones are be chosen?

Question 3

Had reversed the source and destination tables. Have corrected that. Also added an explanation. Thanks!

Question 4

If you only ever have one row were the city and occupation columns are populated you can achieve it with a windowing function:

E.g:

DECLARE @Source TABLE(
 sl_no INT
 ,name NVARCHAR(30)
 ,dob DATETIME2(3)
 ,city NVARCHAR(30)
 ,occupation NVARCHAR(30)
);
INSERT INTO @Source
VALUES
 (1, 'galileo-galilei', '1900-01-01 00:00:00.000', NULL, NULL),
 (2, 'galileo-galilei', '1900-01-02 00:00:00.000', 'venice', NULL),
 (3, 'galileo-galilei', '1900-01-05 00:00:00.000', NULL, 'astronomer'),
 (4, 'issac-newton', '1900-01-01 00:00:00.000', 'london', 'mathematician-scientist')
SELECT DISTINCT
 name
 ,MAX(dob) OVER(PARTITION BY name) AS dob
 ,MAX(city) OVER(PARTITION BY name) AS city
 ,MAX(occupation) OVER(PARTITION BY name) AS occupation
FROM 
 @Source

However, I suspect the reality is you could have multiple records and you always want to return the value from the most recent record that has data in those columns. E.g. if your source was:

DECLARE @Source TABLE(
 sl_no INT
 ,name NVARCHAR(30)
 ,dob DATETIME2(3)
 ,city NVARCHAR(30)
 ,occupation NVARCHAR(30)
);
INSERT INTO @Source
VALUES
 (1, 'galileo-galilei', '1900-01-01 00:00:00.000', 'rome', NULL),
 (2, 'galileo-galilei', '1900-01-02 00:00:00.000', 'venice', NULL),
 (3, 'galileo-galilei', '1900-01-05 00:00:00.000', NULL, 'astronomer'),
 (4, 'issac-newton', '1900-01-01 00:00:00.000', 'london', 'mathematician-scientist')

You could achieve what you want with:

SELECT
 s.name
 ,s.dob
 ,sc.city
 ,so.occupation
FROM
 @Source AS s
 CROSS APPLY(
 SELECT TOP 1 city
 FROM @Source AS s2
 WHERE s2.name = s.name
 AND city IS NOT NULL
 ORDER BY sl_no DESC
 ) AS sc
 CROSS APPLY(
 SELECT TOP 1 occupation
 FROM @Source AS s3
 WHERE s3.name = s.name
 AND occupation IS NOT NULL
 ORDER BY sl_no DESC
 ) AS so
WHERE
 s.sl_no = (SELECT MAX(sl_no) FROM @Source AS s4 WHERE s4.name = s.name)

Wrap that up into a merge or update (I'll do a merge for you) and you will get:

WITH src AS (
 SELECT
 s.name
 ,s.dob
 ,sc.city
 ,so.occupation
 FROM
 @Source AS s
 CROSS APPLY(
 SELECT TOP 1 city
 FROM @Source AS s2
 WHERE s2.name = s.name
 AND city IS NOT NULL
 ORDER BY sl_no DESC
 ) AS sc
 CROSS APPLY(
 SELECT TOP 1 occupation
 FROM @Source AS s3
 WHERE s3.name = s.name
 AND occupation IS NOT NULL
 ORDER BY sl_no DESC
 ) AS so
 WHERE
 s.sl_no = (SELECT MAX(sl_no) FROM @Source AS s4 WHERE s4.name = s.name)
)
MERGE INTO Destination AS tgt
USING tgt.name = src.name
WHEN MATCHED THEN UPDATE
SET dob = src.dob
 ,city = src.city
 ,occupation = src.occupation
WHEN NOT MATCHED THEN INSERT(name, dob, city, occupation)
VALUES(src.name, src.dob, src.city, src.occuptaion);

You are going to want to index the column you join on all the time (name in the above examples) for performance. Otherwise you will get many scans.

Question 5

You are right - there are indeed more rows with different names in the source table. Cross-apply is new to me. Thanks!

Question 6

CROSS APPLY is a bit like a for each. For every row in the left hand side apply it to the right hand side. I use that to go get the most recent populated values for a given set in the code above. I use the TOP 1 and ORDER BY to do that.

Question 7

Did a explain plan on the handful of data that I had for the three approaches above : update-with-join, merge and the merge-with-cross-apply. The breakup was as follows : 36%, 36% and 27%. Thanks again!

Question 8

Correction : The above breakup was using the select-with-cross-apply and not the merge. The merge has a syntax error and a typo . Once that is fixed (USING src on tgt.name = src.name and src.occupation ) - the cost wise break-up for : update-with-join, merge, merge-with-cross-apply is : 28%, 28% and 44% respectively.

Kent Chenery Kent Chenery 5512 silver badges7 bronze badges · Accepted Answer · 2017-11-14 05:10:01Z

If you only ever have one row were the city and occupation columns are populated you can achieve it with a windowing function:

E.g:

DECLARE @Source TABLE(
 sl_no INT
 ,name NVARCHAR(30)
 ,dob DATETIME2(3)
 ,city NVARCHAR(30)
 ,occupation NVARCHAR(30)
);
INSERT INTO @Source
VALUES
 (1, 'galileo-galilei', '1900-01-01 00:00:00.000', NULL, NULL),
 (2, 'galileo-galilei', '1900-01-02 00:00:00.000', 'venice', NULL),
 (3, 'galileo-galilei', '1900-01-05 00:00:00.000', NULL, 'astronomer'),
 (4, 'issac-newton', '1900-01-01 00:00:00.000', 'london', 'mathematician-scientist')
SELECT DISTINCT
 name
 ,MAX(dob) OVER(PARTITION BY name) AS dob
 ,MAX(city) OVER(PARTITION BY name) AS city
 ,MAX(occupation) OVER(PARTITION BY name) AS occupation
FROM 
 @Source

However, I suspect the reality is you could have multiple records and you always want to return the value from the most recent record that has data in those columns. E.g. if your source was:

DECLARE @Source TABLE(
 sl_no INT
 ,name NVARCHAR(30)
 ,dob DATETIME2(3)
 ,city NVARCHAR(30)
 ,occupation NVARCHAR(30)
);
INSERT INTO @Source
VALUES
 (1, 'galileo-galilei', '1900-01-01 00:00:00.000', 'rome', NULL),
 (2, 'galileo-galilei', '1900-01-02 00:00:00.000', 'venice', NULL),
 (3, 'galileo-galilei', '1900-01-05 00:00:00.000', NULL, 'astronomer'),
 (4, 'issac-newton', '1900-01-01 00:00:00.000', 'london', 'mathematician-scientist')

You could achieve what you want with:

SELECT
 s.name
 ,s.dob
 ,sc.city
 ,so.occupation
FROM
 @Source AS s
 CROSS APPLY(
 SELECT TOP 1 city
 FROM @Source AS s2
 WHERE s2.name = s.name
 AND city IS NOT NULL
 ORDER BY sl_no DESC
 ) AS sc
 CROSS APPLY(
 SELECT TOP 1 occupation
 FROM @Source AS s3
 WHERE s3.name = s.name
 AND occupation IS NOT NULL
 ORDER BY sl_no DESC
 ) AS so
WHERE
 s.sl_no = (SELECT MAX(sl_no) FROM @Source AS s4 WHERE s4.name = s.name)

Wrap that up into a merge or update (I'll do a merge for you) and you will get:

WITH src AS (
 SELECT
 s.name
 ,s.dob
 ,sc.city
 ,so.occupation
 FROM
 @Source AS s
 CROSS APPLY(
 SELECT TOP 1 city
 FROM @Source AS s2
 WHERE s2.name = s.name
 AND city IS NOT NULL
 ORDER BY sl_no DESC
 ) AS sc
 CROSS APPLY(
 SELECT TOP 1 occupation
 FROM @Source AS s3
 WHERE s3.name = s.name
 AND occupation IS NOT NULL
 ORDER BY sl_no DESC
 ) AS so
 WHERE
 s.sl_no = (SELECT MAX(sl_no) FROM @Source AS s4 WHERE s4.name = s.name)
)
MERGE INTO Destination AS tgt
USING tgt.name = src.name
WHEN MATCHED THEN UPDATE
SET dob = src.dob
 ,city = src.city
 ,occupation = src.occupation
WHEN NOT MATCHED THEN INSERT(name, dob, city, occupation)
VALUES(src.name, src.dob, src.city, src.occuptaion);

You are going to want to index the column you join on all the time (name in the above examples) for performance. Otherwise you will get many scans.

You are right - there are indeed more rows with different names in the source table. Cross-apply is new to me. Thanks!
CROSS APPLY is a bit like a for each. For every row in the left hand side apply it to the right hand side. I use that to go get the most recent populated values for a given set in the code above. I use the TOP 1 and ORDER BY to do that.
Did a explain plan on the handful of data that I had for the three approaches above : update-with-join, merge and the merge-with-cross-apply. The breakup was as follows : 36%, 36% and 27%. Thanks again!
Correction : The above breakup was using the select-with-cross-apply and not the merge. The merge has a syntax error and a typo . Once that is fixed (USING src on tgt.name = src.name and src.occupation ) - the cost wise break-up for : update-with-join, merge, merge-with-cross-apply is : 28%, 28% and 44% respectively.

Stack Exchange Network

SQL Server : Merge and update multiple rows across columns

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

SQL Server : Merge and update multiple rows across columns

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions