We have a #ValidCode table with list of valid codes like: 'A', 'B', 'C', etc. Another table called #SourceData with input data -- that comes as a combination of valid and invalid tokens (sometimes duplicates).
Ex:
- 'A;B;C' (valid)
- 'A;A;A;A;A;B' (Valid)
- 'ad;df;A;B' (invalid)
Trying to find an optimal query approach to process these strings to find valid rows in #SourceData. See example below:
DROP TABLE IF EXISTS #ValidCode
GO
CREATE TABLE #ValidCode
(
ID INT IDENTITY(1,1)
, Code CHAR(1)
)
INSERT INTO #ValidCode (Code) VALUES ('A'), ('B'), ('C')
GO
DROP TABLE IF EXISTS #SourceData
GO
CREATE TABLE #SourceData
(
ID INT IDENTITY(1,1)
, Codes VARCHAR(500)
, Is_Valid BIT
, Is_Split BIT
)
INSERT INTO #SourceData (Codes)
VALUES ('A;B;C')
, ('B;A')
, ('B;B;B;C;C;A;A;B')
, ('B;Z;1')
, ('B;ss;asd')
SELECT * FROM #ValidCode
SELECT * FROM #SourceData
Query would process the data in #SourceData table and update the Is_Valid flag, so they could be consumed in the subsequent process.
Rules:
- Each and every token must be valid for the entire column row to be valid (rows 1 to 3)
- Even if one token is invalid, then entire row value is invalid (rows 4 & 5)
So, this is the preferred output:
ID | Codes | Is_Valid |
---|---|---|
1 | A;B;C | 1 |
2 | B;A | 1 |
3 | B;B;B;C;C;A;A;B | 1 |
4 | B;Z;1 | 0 |
5 | B;ss;asd | 0 |
Current approach: Loop through each row in #SourceData and split them on delimiter ';', then compare them to the #ValidCode table. If all tokens are individually valid, then mark the row in #SourceData as valid (Is_Valid flag). Else mark as invalid. The WHILE
loop approach works, but is slow.
The #SourceData could have up to 3 million rows. With each row having multiple duplicate valid ('A;A;A;A') and invalid values combination ('A;as;sdf;B')
Is there a better approach?
Thanks!
-
1storing delimited data in a column, is generally a bad decision, a normalized Approach to store data, would save lots of processing powernbk– nbk2024年11月19日 23:15:35 +00:00Commented Nov 19, 2024 at 23:15
-
@nbk the source data comes in as delimited. It cannot be changed.ToC– ToC2024年11月20日 14:19:52 +00:00Commented Nov 20, 2024 at 14:19
2 Answers 2
One relational way you can do this is by splitting your #SourceData
first (fortunately you have access to the STRING_SPLIT()
function despite being on an outdated version of SQL Server), then getting the rows that don't match your #ValidCodes
, and finally using those rows to determine what Is_Valid
in the original #SourceData
table.
Here's an example of how to do that:
;WITH _BadData AS
(
SELECT DISTINCT SD.ID
FROM #SourceData AS SD
CROSS APPLY STRING_SPLIT(SD.Codes, ';') AS SS
WHERE NOT EXISTS
(
SELECT 1 AS RowExists
FROM #ValidCode AS VC
WHERE SS.[value] = VC.Code
)
)
SELECT
SD.ID,
SD.Codes,
ISNULL(SD.Is_Valid, IIF(BD.ID IS NULL, 1, 0)) AS Is_Valid,
SD.Is_Split
FROM #SourceData AS SD
LEFT JOIN _BadData AS BD
ON SD.ID = BD.ID;
Here's a dbfiddle.uk repo demonstrating that code.
Note depending on the size of #SourceData
and the generated execution plan, you may want to materialize the results of the STRING_SPLIT()
function, the whole CTE itself, or both, to a temp table first, before using it in the second half of the above query for the final LEFT JOIN
. But I assume this should be measurably better than looping over your rows, one by one.
-
Thanks for the idea, I'll try this too. For each row, I need to preserve the "Is_Valid" flag i.e. irrespective of if a row is valid or not, I need to know and preserve that validation result. The CTE above seems to combine all tokens into one table -- which may result in not being able to capture the validity flag for each row of #SourceDataToC– ToC2024年11月20日 14:23:44 +00:00Commented Nov 20, 2024 at 14:23
-
1@ToC Gotcha, your sample data didn't include that, so you may want to consider updating your post to clarify that, as I believe you may have the same issue with the other answer that was provided too. Please see my updated answer that preserves the original
Is_Valid
flag as well.J.D.– J.D.2024年11月20日 15:43:18 +00:00Commented Nov 20, 2024 at 15:43 -
This is such a creative idea. I cannot believe it works !! I'll keep playing with it to see if it covers all the scenarios. This is a solid foundation for me to add other components. Thanks !!ToC– ToC2024年11月20日 20:43:18 +00:00Commented Nov 20, 2024 at 20:43
-
1@ToC Great, no problem! Best of luck!J.D.– J.D.2024年11月20日 22:45:40 +00:00Commented Nov 20, 2024 at 22:45
-- first thing that comes to mind:
SELECT
sd.ID
, sd.Codes
, CASE
WHEN NOT EXISTS (
SELECT x.[value]
FROM string_split(sd.Codes, ';') as x
LEFT OUTER JOIN #ValidCode as vc
ON x.[value] = vc.Code
WHERE vc.Code IS NULL
)
THEN 1
ELSE 0
END as is_valid
FROM #SourceData as sd
It might be more optimal to create a child table for the SourceData codes split out into rows.
-
Interesting approach! I'll try it.ToC– ToC2024年11月19日 21:31:57 +00:00Commented Nov 19, 2024 at 21:31
Explore related questions
See similar questions with these tags.