I am trying to split two columns with delimited strings into rows. The positions of the values in each string are related so I am trying to split it out so that the related values are in a row. I am unable to use function as I cannot create objects in the database
Here is sample table and data
CREATE TABLE #temp
(id INT,
keys VARCHAR(50),
vals VARCHAR(50)
);
INSERT INTO #temp
VALUES
(1, '1,2,3', 'one,two,three'),
(2, '4,5,6', 'four,five,six'),
(3, '7,8,9', 'seven,eight,nine');
and my desired output would be
ID key val
1 1 one
1 2 two
1 3 three
2 4 four
2 5 five
2 6 six
3 7 seven
3 8 eight
3 9 nine
I got the query to work if I only split one column, so I define two CTEs with row_number and join on ID and row_number. This does give desired output but my live table is very large and I was hoping for a way to pass through the table only once, instead of twice.
with keys as(
SELECT id,keys,vals,
keys.keyid.value('.', 'VARCHAR(8000)') AS keyid,
row_number() over(order by (select null)) as rn
FROM
(SELECT id,keys,vals,
CAST('<Keys><key>'+REPLACE(keys, ',', '</key><key>')+'</key></Keys>' AS XML) AS tempkeys
FROM #temp
) AS temp
CROSS APPLY tempkeys.nodes('/Keys/key') AS keys(keyid)),
vals as(
SELECT id,keys,vals,
vals.val.value('.', 'VARCHAR(8000)') AS valid,
row_number() over(order by (select null)) as rn
FROM
(SELECT id,keys,vals,
CAST('<vals><val>'+REPLACE(vals, ',', '</val><val>')+'</val></vals>' AS XML) AS tempvals
FROM #temp
) AS temp
CROSS APPLY tempvals.nodes('/vals/val') AS vals(val))
SELECT k.id, k.keyid, v.valid
FROM keys AS k
INNER JOIN vals AS v
ON k.id = v.id
AND k.rn = v.rn;
-
2Why are you not allowed to create objects in the database? Can't you ask someone to do this for you? There are tons of good functions out there that have been proven to be about as efficient as you're going to be able to do this kind of thing (without fixing the design). So what is the actual roadblock, and can you work on that?Aaron Bertrand– Aaron Bertrand2018年01月09日 19:44:27 +00:00Commented Jan 9, 2018 at 19:44
-
@AaronBertrand I have the ability to create functions but the database is a dynamicsCRM and schema changes in the database are not supported. I am not sure whether a function would also violate the supportability of the database so we just dont make changes to it. My other thought would have been to create a database with just a split function, or put it in master/msdb and just run query with split function that way. Wouldn't I have same issue having to apply function to the table twice for each column?Bob Klimes– Bob Klimes2018年01月09日 19:55:43 +00:00Commented Jan 9, 2018 at 19:55
-
1I don't see how a basic string splitting function could violate supportability in any way, or why it couldn't be in any other database. If you are open to a function (regardless of where it lives), I am fairly confident that we can do it in a tidier and hopefully more efficient way than your current approach. My only concern is what to do when the number of keys and the number of values doesn't match,Aaron Bertrand– Aaron Bertrand2018年01月09日 19:58:48 +00:00Commented Jan 9, 2018 at 19:58
-
Put the function into another database, call it via a 3 part object name. Your constraint is quite arbitrary: massively complex code with XML instead of a tidy CLR or numbers table functiongbn– gbn2018年01月09日 20:31:40 +00:00Commented Jan 9, 2018 at 20:31
2 Answers 2
Create the function in msdb
or somewhere else.
CREATE FUNCTION dbo.SplitTwoStringsWithSameOrder
(
@List1 varchar(50),
@List2 varchar(50),
@Delim varchar(10)
)
RETURNS TABLE
AS
RETURN
(
WITH src(r) AS
(
SELECT 1 UNION ALL SELECT r + 1 FROM src WHERE r < 10
),
Numbers(Number) AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM src AS s1, src AS s2 -- add more if you need longer strings
),
parsed(s1,s2,r1,r2)
AS
(
SELECT
SUBSTRING(@List1, n1.Number, CHARINDEX(@Delim, @List1
+ @Delim, n1.Number) - n1.Number),
SUBSTRING(@List2, n2.Number, CHARINDEX(@Delim, @List2
+ @Delim, n2.Number) - n2.Number),
r1 = ROW_NUMBER() OVER (ORDER BY n1.Number),
r2 = ROW_NUMBER() OVER (ORDER BY n2.Number)
FROM Numbers AS n1, Numbers AS n2
ON n1.Number <= LEN(@List1)
AND n2.Number <= LEN(@List2)
AND SUBSTRING(@Delim + @List1, n1.Number, LEN(@Delim)) = @Delim
AND SUBSTRING(@Delim + @List2, n2.Number, LEN(@Delim)) = @Delim
)
SELECT s1, s2, r1, r2 FROM parsed WHERE r1 = r2
);
Then, as @gbn noted, reference it by 3-part name wherever your query has to run.
CREATE TABLE #temp
(id INT,
keys VARCHAR(50),
vals VARCHAR(50)
);
INSERT INTO #temp
VALUES
(1, '1,2,3', 'one,two,three'),
(2, '4,5,6', 'four,five,six'),
(3, '7,8,9', 'seven,eight,nine');
SELECT t.id, f.s1, f.s2 FROM #temp AS t
CROSS APPLY msdb.dbo.SplitTwoStringsWithSameOrder(keys, vals, ',') AS f
ORDER BY t.id, f.r1;
GO
DROP TABLE #temp;
Results:
The resulting plan, shown in Plan Explorer (disclaimer: I'm the Product Manager), is not the prettiest thing I've ever seen (click to enlarge a little bit):
But there is exactly one scan of #temp (4% cost). The biggest costs are two sorts and a spool, and there is some I/O due to a worktable which I am not sure is avoidable.
If you KNOW you will only ever have 50 characters in either of these strings, then you can get a much simpler plan with a built-in Numbers
table (people object to these, but they're very useful, and they are almost always in memory if you reference them enough). This doesn't help I/O but removing the recursive CTE and other constructs of building the numbers inside the function is quite helpful for CPU etc.
First, the numbers table:
DROP TABLE dbo.Numbers;
;WITH n AS
(
SELECT
TOP (50) rn = ROW_NUMBER() OVER
(ORDER BY [object_id])
FROM sys.all_columns
ORDER BY [object_id]
)
SELECT [Number] = rn - 1
INTO dbo.Numbers
FROM n;
CREATE UNIQUE CLUSTERED INDEX n ON dbo.Numbers([Number]);
Then a second version of the function:
CREATE FUNCTION dbo.SplitTwoStringsWithSameOrder2
(
@List1 varchar(50),
@List2 varchar(50),
@Delim nvarchar(10)
)
RETURNS TABLE
AS
RETURN
(
WITH parsed(s1,s2,r1,r2)
AS
(
SELECT
SUBSTRING(@List1, n1.Number, CHARINDEX(@Delim, @List1
+ @Delim, n1.Number) - n1.Number),
SUBSTRING(@List2, n2.Number, CHARINDEX(@Delim, @List2
+ @Delim, n2.Number) - n2.Number),
r1 = ROW_NUMBER() OVER (ORDER BY n1.Number),
r2 = ROW_NUMBER() OVER (ORDER BY n2.Number)
FROM dbo.Numbers AS n1
INNER JOIN dbo.Numbers AS n2
ON n1.Number <= LEN(@List1)
AND n2.Number <= LEN(@List2)
AND SUBSTRING(@Delim + @List1, n1.Number, LEN(@Delim)) = @Delim
AND SUBSTRING(@Delim + @List2, n2.Number, LEN(@Delim)) = @Delim
)
SELECT s1, s2, r1, r2 FROM parsed WHERE r1 = r2
);
GO
Here is the simpler plan that results (again, click to enlarge):
The plan still has two sort operations, but the spool is gone, there is still only one scan of #temp
, and in my limited tests the cost numbers (absolute cost numbers, not %) were better every time.
I don't know precisely either of these will scale with a lot more rows, but it's worth testing, and if you weigh this against other solutions and it can't scale well, that may be a point your reconsider the design (store these relationally instead of as comma-separated sets).
I have stepped into the same problem with a large numbers of rows and four list columns.
The previous solutions doesn`t workout for me.
The solution from @AaronBertrand has a problem with a different number of elements in the list. The problem can be solved by adding a partition by on the ROW_NUMBER:
r1 = ROW_NUMBER() OVER (PARTITION BY n2.Number ORDER BY n1.Number)
However still doesn't workout for me because of my large numbers of rows and elements.
I created the following script to solve my problem without using a function:
DROP TABLE IF EXISTS #temp
CREATE TABLE #temp
(
id INT,
keys VARCHAR(4000),
vals VARCHAR(4000)
);
INSERT INTO #temp
VALUES
(1, '1,2,3', 'one,two,three'),
(2, '4,5,6', 'four,five,six'),
(3, '7,8,9', 'seven,eight,nine');
DECLARE @delim VARCHAR(4000) = ',';
WITH split AS (
SELECT
id
,CONVERT(VARCHAR(4000), CONCAT(keys, @delim)) AS keys
,CONVERT(VARCHAR(4000), CONCAT(vals, @delim)) AS vals
,1 AS iniciokeys
,COALESCE(NULLIF(CHARINDEX(@delim, keys, 1), 0), LEN(keys)) AS fimkeys
,CONVERT(VARCHAR(4000), RTRIM(LTRIM(SUBSTRING(keys, 1, COALESCE(NULLIF(CHARINDEX(@delim, keys, 1), 0), LEN(keys)) - 1)))) AS vkeys
,1 AS iniciovals
,COALESCE(NULLIF(CHARINDEX(@delim, vals, 1), 0), LEN(vals)) AS fimvals
,CONVERT(VARCHAR(4000), RTRIM(LTRIM(SUBSTRING(vals, 1, COALESCE(NULLIF(CHARINDEX(@delim, vals, 1), 0), LEN(vals)) - 1)))) AS vvals
FROM #temp
WHERE LEN(keys) > 0
AND LEN(vals) > 0
UNION ALL
SELECT
id
,CONVERT(VARCHAR(4000), keys) AS keys
,CONVERT(VARCHAR(4000), vals) AS vals
,CONVERT(INT, fimkeys) + 1 AS iniciokeys
,COALESCE(NULLIF(CHARINDEX(@delim, keys, fimkeys + 1), 0), LEN(keys)) AS fimkeys
,CONVERT(VARCHAR(4000), RTRIM(LTRIM(SUBSTRING(keys, fimkeys + 1, COALESCE(NULLIF(CHARINDEX(@delim, keys, fimkeys + 1), 0), LEN(keys))-fimkeys-1)))) AS vkeys
,CONVERT(INT, fimvals) + 1 AS iniciovals
,COALESCE(NULLIF(CHARINDEX(@delim, vals, fimvals + 1), 0), LEN(vals)) AS fimvals
,CONVERT(VARCHAR(4000), RTRIM(LTRIM(SUBSTRING(vals, fimvals + 1, COALESCE(NULLIF(CHARINDEX(@delim, vals, fimvals + 1), 0), LEN(vals))-fimvals-1)))) AS vvals
FROM split
WHERE fimkeys < LEN(keys)
AND fimvals < LEN(vals)
)
SELECT
id
,vkeys
,vvals
FROM split
ORDER BY id
,vkeys
OPTION(MAXRECURSION 32767)
Results:
Query plan:
As you can see, the query plan is very simple and have only one table scan on #temp.
The solution is also quite scalable.
Explore related questions
See similar questions with these tags.