Sql Server: Query to parse and validate codes

Question 1

We have a #ValidCode table with list of valid codes like: 'A', 'B', 'C', etc. Another table called #SourceData with input data -- that comes as a combination of valid and invalid tokens (sometimes duplicates).

Ex:

'A;B;C' (valid)
'A;A;A;A;A;B' (Valid)
'ad;df;A;B' (invalid)

Trying to find an optimal query approach to process these strings to find valid rows in #SourceData. See example below:

DROP TABLE IF EXISTS #ValidCode
GO
CREATE TABLE #ValidCode
(
 ID INT IDENTITY(1,1)
 , Code CHAR(1)
)
INSERT INTO #ValidCode (Code) VALUES ('A'), ('B'), ('C')
GO
DROP TABLE IF EXISTS #SourceData 
GO
CREATE TABLE #SourceData 
(
 ID INT IDENTITY(1,1)
 , Codes VARCHAR(500)
 , Is_Valid BIT
 , Is_Split BIT
)
INSERT INTO #SourceData (Codes) 
VALUES ('A;B;C')
 , ('B;A')
 , ('B;B;B;C;C;A;A;B')
 , ('B;Z;1')
 , ('B;ss;asd')
SELECT * FROM #ValidCode
SELECT * FROM #SourceData

Query would process the data in #SourceData table and update the Is_Valid flag, so they could be consumed in the subsequent process.

Rules:

Each and every token must be valid for the entire column row to be valid (rows 1 to 3)
Even if one token is invalid, then entire row value is invalid (rows 4 & 5)

So, this is the preferred output:

ID	Codes	Is_Valid
1	A;B;C	1
2	B;A	1
3	B;B;B;C;C;A;A;B	1
4	B;Z;1	0
5	B;ss;asd	0

Current approach: Loop through each row in #SourceData and split them on delimiter ';', then compare them to the #ValidCode table. If all tokens are individually valid, then mark the row in #SourceData as valid (Is_Valid flag). Else mark as invalid. The WHILE loop approach works, but is slow.

The #SourceData could have up to 3 million rows. With each row having multiple duplicate valid ('A;A;A;A') and invalid values combination ('A;as;sdf;B')

Is there a better approach?

Thanks!

Question 2

storing delimited data in a column, is generally a bad decision, a normalized Approach to store data, would save lots of processing power

Question 3

@nbk the source data comes in as delimited. It cannot be changed.

Question 4

One relational way you can do this is by splitting your #SourceData first (fortunately you have access to the STRING_SPLIT() function despite being on an outdated version of SQL Server), then getting the rows that don't match your #ValidCodes, and finally using those rows to determine what Is_Valid in the original #SourceData table.

Here's an example of how to do that:

;WITH _BadData AS
(
 SELECT DISTINCT SD.ID
 FROM #SourceData AS SD
 CROSS APPLY STRING_SPLIT(SD.Codes, ';') AS SS
 WHERE NOT EXISTS
 (
 SELECT 1 AS RowExists
 FROM #ValidCode AS VC
 WHERE SS.[value] = VC.Code
 )
)
SELECT
 SD.ID,
 SD.Codes,
 ISNULL(SD.Is_Valid, IIF(BD.ID IS NULL, 1, 0)) AS Is_Valid,
 SD.Is_Split
FROM #SourceData AS SD
LEFT JOIN _BadData AS BD
 ON SD.ID = BD.ID;

Here's a dbfiddle.uk repo demonstrating that code.

Note depending on the size of #SourceData and the generated execution plan, you may want to materialize the results of the STRING_SPLIT() function, the whole CTE itself, or both, to a temp table first, before using it in the second half of the above query for the final LEFT JOIN. But I assume this should be measurably better than looping over your rows, one by one.

Question 5

Thanks for the idea, I'll try this too. For each row, I need to preserve the "Is_Valid" flag i.e. irrespective of if a row is valid or not, I need to know and preserve that validation result. The CTE above seems to combine all tokens into one table -- which may result in not being able to capture the validity flag for each row of #SourceData

Question 6

@ToC Gotcha, your sample data didn't include that, so you may want to consider updating your post to clarify that, as I believe you may have the same issue with the other answer that was provided too. Please see my updated answer that preserves the original Is_Valid flag as well.

Question 7

This is such a creative idea. I cannot believe it works !! I'll keep playing with it to see if it covers all the scenarios. This is a solid foundation for me to add other components. Thanks !!

Question 8

@ToC Great, no problem! Best of luck!

Question 9

-- first thing that comes to mind:

SELECT 
 sd.ID
 , sd.Codes
 , CASE 
 WHEN NOT EXISTS (
 SELECT x.[value] 
 FROM string_split(sd.Codes, ';') as x
 LEFT OUTER JOIN #ValidCode as vc
 ON x.[value] = vc.Code
 WHERE vc.Code IS NULL
 )
 THEN 1
 ELSE 0
 END as is_valid
FROM #SourceData as sd

It might be more optimal to create a child table for the SourceData codes split out into rows.

Question 10

Interesting approach! I'll try it.

J.D. J.D. 41.1k12 gold badges64 silver badges145 bronze badges · Accepted Answer · 2024-11-20 13:30:36Z

One relational way you can do this is by splitting your #SourceData first (fortunately you have access to the STRING_SPLIT() function despite being on an outdated version of SQL Server), then getting the rows that don't match your #ValidCodes, and finally using those rows to determine what Is_Valid in the original #SourceData table.

Here's an example of how to do that:

;WITH _BadData AS
(
 SELECT DISTINCT SD.ID
 FROM #SourceData AS SD
 CROSS APPLY STRING_SPLIT(SD.Codes, ';') AS SS
 WHERE NOT EXISTS
 (
 SELECT 1 AS RowExists
 FROM #ValidCode AS VC
 WHERE SS.[value] = VC.Code
 )
)
SELECT
 SD.ID,
 SD.Codes,
 ISNULL(SD.Is_Valid, IIF(BD.ID IS NULL, 1, 0)) AS Is_Valid,
 SD.Is_Split
FROM #SourceData AS SD
LEFT JOIN _BadData AS BD
 ON SD.ID = BD.ID;

Here's a dbfiddle.uk repo demonstrating that code.

Note depending on the size of #SourceData and the generated execution plan, you may want to materialize the results of the STRING_SPLIT() function, the whole CTE itself, or both, to a temp table first, before using it in the second half of the above query for the final LEFT JOIN. But I assume this should be measurably better than looping over your rows, one by one.

Thanks for the idea, I'll try this too. For each row, I need to preserve the "Is_Valid" flag i.e. irrespective of if a row is valid or not, I need to know and preserve that validation result. The CTE above seems to combine all tokens into one table -- which may result in not being able to capture the validity flag for each row of #SourceData
@ToC Gotcha, your sample data didn't include that, so you may want to consider updating your post to clarify that, as I believe you may have the same issue with the other answer that was provided too. Please see my updated answer that preserves the original Is_Valid flag as well.
This is such a creative idea. I cannot believe it works !! I'll keep playing with it to see if it covers all the scenarios. This is a solid foundation for me to add other components. Thanks !!

Stack Exchange Network

Sql Server: Query to parse and validate codes

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Sql Server: Query to parse and validate codes

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions