1

We have a #ValidCode table with list of valid codes like: 'A', 'B', 'C', etc. Another table called #SourceData with input data -- that comes as a combination of valid and invalid tokens (sometimes duplicates).

Ex:

  • 'A;B;C' (valid)
  • 'A;A;A;A;A;B' (Valid)
  • 'ad;df;A;B' (invalid)

Trying to find an optimal query approach to process these strings to find valid rows in #SourceData. See example below:

DROP TABLE IF EXISTS #ValidCode
GO
CREATE TABLE #ValidCode
(
 ID INT IDENTITY(1,1)
 , Code CHAR(1)
)
INSERT INTO #ValidCode (Code) VALUES ('A'), ('B'), ('C')
GO
DROP TABLE IF EXISTS #SourceData 
GO
CREATE TABLE #SourceData 
(
 ID INT IDENTITY(1,1)
 , Codes VARCHAR(500)
 , Is_Valid BIT
 , Is_Split BIT
)
INSERT INTO #SourceData (Codes) 
VALUES ('A;B;C')
 , ('B;A')
 , ('B;B;B;C;C;A;A;B')
 , ('B;Z;1')
 , ('B;ss;asd')
SELECT * FROM #ValidCode
SELECT * FROM #SourceData

Query would process the data in #SourceData table and update the Is_Valid flag, so they could be consumed in the subsequent process.

Rules:

  • Each and every token must be valid for the entire column row to be valid (rows 1 to 3)
  • Even if one token is invalid, then entire row value is invalid (rows 4 & 5)

So, this is the preferred output:

ID Codes Is_Valid
1 A;B;C 1
2 B;A 1
3 B;B;B;C;C;A;A;B 1
4 B;Z;1 0
5 B;ss;asd 0

Current approach: Loop through each row in #SourceData and split them on delimiter ';', then compare them to the #ValidCode table. If all tokens are individually valid, then mark the row in #SourceData as valid (Is_Valid flag). Else mark as invalid. The WHILE loop approach works, but is slow.

The #SourceData could have up to 3 million rows. With each row having multiple duplicate valid ('A;A;A;A') and invalid values combination ('A;as;sdf;B')

Is there a better approach?

Thanks!

asked Nov 19, 2024 at 19:53
2
  • 1
    storing delimited data in a column, is generally a bad decision, a normalized Approach to store data, would save lots of processing power Commented Nov 19, 2024 at 23:15
  • @nbk the source data comes in as delimited. It cannot be changed. Commented Nov 20, 2024 at 14:19

2 Answers 2

1

One relational way you can do this is by splitting your #SourceData first (fortunately you have access to the STRING_SPLIT() function despite being on an outdated version of SQL Server), then getting the rows that don't match your #ValidCodes, and finally using those rows to determine what Is_Valid in the original #SourceData table.

Here's an example of how to do that:

;WITH _BadData AS
(
 SELECT DISTINCT SD.ID
 FROM #SourceData AS SD
 CROSS APPLY STRING_SPLIT(SD.Codes, ';') AS SS
 WHERE NOT EXISTS
 (
 SELECT 1 AS RowExists
 FROM #ValidCode AS VC
 WHERE SS.[value] = VC.Code
 )
)
SELECT
 SD.ID,
 SD.Codes,
 ISNULL(SD.Is_Valid, IIF(BD.ID IS NULL, 1, 0)) AS Is_Valid,
 SD.Is_Split
FROM #SourceData AS SD
LEFT JOIN _BadData AS BD
 ON SD.ID = BD.ID;

Here's a dbfiddle.uk repo demonstrating that code.

Note depending on the size of #SourceData and the generated execution plan, you may want to materialize the results of the STRING_SPLIT() function, the whole CTE itself, or both, to a temp table first, before using it in the second half of the above query for the final LEFT JOIN. But I assume this should be measurably better than looping over your rows, one by one.

answered Nov 20, 2024 at 13:30
4
  • Thanks for the idea, I'll try this too. For each row, I need to preserve the "Is_Valid" flag i.e. irrespective of if a row is valid or not, I need to know and preserve that validation result. The CTE above seems to combine all tokens into one table -- which may result in not being able to capture the validity flag for each row of #SourceData Commented Nov 20, 2024 at 14:23
  • 1
    @ToC Gotcha, your sample data didn't include that, so you may want to consider updating your post to clarify that, as I believe you may have the same issue with the other answer that was provided too. Please see my updated answer that preserves the original Is_Valid flag as well. Commented Nov 20, 2024 at 15:43
  • This is such a creative idea. I cannot believe it works !! I'll keep playing with it to see if it covers all the scenarios. This is a solid foundation for me to add other components. Thanks !! Commented Nov 20, 2024 at 20:43
  • 1
    @ToC Great, no problem! Best of luck! Commented Nov 20, 2024 at 22:45
1

-- first thing that comes to mind:

SELECT 
 sd.ID
 , sd.Codes
 , CASE 
 WHEN NOT EXISTS (
 SELECT x.[value] 
 FROM string_split(sd.Codes, ';') as x
 LEFT OUTER JOIN #ValidCode as vc
 ON x.[value] = vc.Code
 WHERE vc.Code IS NULL
 )
 THEN 1
 ELSE 0
 END as is_valid
FROM #SourceData as sd

It might be more optimal to create a child table for the SourceData codes split out into rows.

answered Nov 19, 2024 at 20:44
1
  • Interesting approach! I'll try it. Commented Nov 19, 2024 at 21:31

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.