Fill in missing values without double dip to data table

Question 1

Suppose there are two lists of categories and then a data table with combinations of those categories, along with various other values.

Suppose that there can only be one instance of any category combination in the data table.

We want a list of every possible category combination along with the values, if any, from the data table.

This is relatively straightforward if we have a list of each set of categories. We can CROSS JOIN the category lists together and LEFT JOIN over to the data.

But what if one of the category lists doesn't separately exist? Rather, the categories only exist as entries in the data table.

One way to handle this is to hit the data table first to get the distinct list of one type of categories, CROSS JOIN that to the other categories, and then LEFT JOIN back into the database table to fill in the values, as before.

For example:

WITH CategoryData AS
(
 SELECT
 *
 FROM (
 VALUES
 ('A', 1, 'bleep'),
 ('A', 2, 'blorp'),
 ('B', 2, 'blarp'),
 ('C', 3, 'blurp')
 ) AS t1(TextCategory, NumericCategory, Val)
),
NumericCategoryList AS
(
 SELECT
 *
 FROM (
 VALUES
 (1),
 (2),
 (3)
 ) AS t2(NumericCategory)
)
SELECT
 TextCategoryList.TextCategory,
 NumericCategoryList.NumericCategory,
 CategoryData.Val
FROM (
 SELECT
 DISTINCT
 TextCategory
 FROM CategoryData
 ) AS TextCategoryList
CROSS JOIN NumericCategoryList
LEFT JOIN CategoryData
ON CategoryData.TextCategory = TextCategoryList.TextCategory
AND CategoryData.NumericCategory = NumericCategoryList.NumericCategory;

That can be expensive with a large data table.

With date ranges or small number tables, you can generate the Cartesian product of all category combination possibilities less expensively (and with computation only), but if the categories are larger, arbitrary values, that's not practical.

So is there a way for the engine to go over the values of the data table once, generating any missing values by looping over the known categories as it goes?

The categories are independent and the values are arbitrary. Perhaps the textual Category is someone's name, the numeric Category is the ID of the snack type they ate, and the value is the sound of their burp.

Not all people ate all snacks, but I want a list of each person with every possible snack, and their burp sound when they ate that snack (with NULL if they didn't eat that snack). (Let's pretend there are infinitely many possible transcriptions of burp sounds; there isn't a list of burp sounds.)

This involves a third party vendor product. I am hoping there is some fancy window logic which I couldn’t figure out.

Question 2

One neat solution is to create an indexed view which pre-aggregates the list of TextCategory.

Some caveats with indexed views:

To use aggregation, you cannot use DISTINCT, you must use GROUP BY and you must have a COUNT_BIG(*) column.
The view must be schema-bound, so you can't drop the base columns.
The only join allowed is an INNER, and no derived tables, subqueries, CTEs or TVFs.
It's normally better to access an indexed view using WITH (NOEXPAND), for various reasons. To enforce this, I create a separate, normal view, which selects from the indexed view with that hint. You don't have to do this step, you could put NOEXPAND directly into your query.

CREATE VIEW dbo.TextCategoryList_Indexed
WITH SCHEMABINDING
AS
SELECT
 cd.TextCategory,
 COUNT_BIG(*) Count -- must have COUNT_BIG
FROM dbo.CategoryData cd
GROUP BY
 cd.TextCategory;

CREATE UNIQUE CLUSTERED INDEX CX_TextCategoryList ON TextCategoryList_Indexed (TextCategory);

Now to make sure to use NOEXPAND we create another view:

CREATE VIEW dbo.TextCategoryList
AS
SELECT *
FROM dbo.TextCategoryList_Indexed WITH (NOEXPAND)

Then we can run your exact query using it:

SELECT
 tc.TextCategory,
 nc.NumericCategory,
 cd.Val
FROM TextCategoryList tc
CROSS JOIN NumericCategoryList nc
LEFT JOIN CategoryData cd
 ON cd.TextCategory = tc.TextCategory
 AND cd.NumericCategory = nc.NumericCategory;

db<>fiddle

And the query plan looks nice and neat

Query Plan

You could probably implement the indexed view with triggers instead, but it won't be as efficient. Your other option is an SQLCLR aggregate function, which you use as a windowed aggregate.

Question 3

Instead of only filling in the missing data, first ensure you have every possible combination of the categories by doing a CROSS JOIN of the entire data table with the known category list.

You'll end up with a mess of duplicated data, but you'll also have at least one of every combination.

Then, make calculated columns for your data values, setting them to NULL if they're not the "real" values for that combination.

You'll end up with one non-forced-NULL value for each category combination and then a bunch of forced-NULL values for that same combination.

Text Num NumFromList Val CalcVal
A 1 1 bleep bleep
A 1 2 bleep NULL
A 1 3 bleep NULL
A 2 1 blorp NULL
A 2 2 blorp blorp
A 2 3 blorp NULL
...

Now you need to ignore all of the "bad" values for each of the combinations. You can't just exclude all of the NULLs, because then you're back to where you started.

Instead, you need to find a single value for each combination of categories. If there's any non-NULL value, you want that. If there are only NULLs, then you want NULL.

So GROUP BY the categories and choose the MAX value from your calculated field. The excess NULLs will fall away, leaving only good values where they exist and NULLs where they don't.

Here's the SQL using the examples tables:

WITH CategoryData AS
(
 SELECT
 *
 FROM (
 VALUES
 ('A', 1, 'bleep'),
 ('A', 2, 'blorp'),
 ('B', 2, 'blarp'),
 ('C', 3, 'blurp')
 ) AS t1(TextCategory, NumericCategory, Val)
),
NumericCategoryList AS
(
 SELECT
 *
 FROM (
 VALUES
 (1),
 (2),
 (3)
 ) AS t2(NumericCategory)
)
SELECT
 CategoryData.TextCategory,
 NumericCategoryList.NumericCategory,
 Val = MAX(CalcVals.CalcVal)
FROM CategoryData
CROSS JOIN NumericCategoryList
CROSS APPLY (
 SELECT
 CalcVal = 
 CASE
 WHEN CategoryData.NumericCategory = NumericCategoryList.NumericCategory THEN 
 CategoryData.Val
 ELSE
 NULL
 END
 ) AS CalcVals
GROUP BY CategoryData.TextCategory,
 NumericCategoryList.NumericCategory;

Using SSMS to compare the question's example query with the double dip to this version shows a very slight advantage to this version (0.115337 vs 0.115049).

Query plan comparison showing more complexity with the question's example (top) than the answer's query (bottom)

However, I suspect this will be sensitive to the actual table values and quantities of categories.

Charlieface Charlieface 17.8k22 silver badges45 bronze badges · Answer 1 · 2022-06-14 19:55:16Z

One neat solution is to create an indexed view which pre-aggregates the list of TextCategory.

Some caveats with indexed views:

To use aggregation, you cannot use DISTINCT, you must use GROUP BY and you must have a COUNT_BIG(*) column.
The view must be schema-bound, so you can't drop the base columns.
The only join allowed is an INNER, and no derived tables, subqueries, CTEs or TVFs.
It's normally better to access an indexed view using WITH (NOEXPAND), for various reasons. To enforce this, I create a separate, normal view, which selects from the indexed view with that hint. You don't have to do this step, you could put NOEXPAND directly into your query.

CREATE VIEW dbo.TextCategoryList_Indexed
WITH SCHEMABINDING
AS
SELECT
 cd.TextCategory,
 COUNT_BIG(*) Count -- must have COUNT_BIG
FROM dbo.CategoryData cd
GROUP BY
 cd.TextCategory;

CREATE UNIQUE CLUSTERED INDEX CX_TextCategoryList ON TextCategoryList_Indexed (TextCategory);

Now to make sure to use NOEXPAND we create another view:

CREATE VIEW dbo.TextCategoryList
AS
SELECT *
FROM dbo.TextCategoryList_Indexed WITH (NOEXPAND)

Then we can run your exact query using it:

SELECT
 tc.TextCategory,
 nc.NumericCategory,
 cd.Val
FROM TextCategoryList tc
CROSS JOIN NumericCategoryList nc
LEFT JOIN CategoryData cd
 ON cd.TextCategory = tc.TextCategory
 AND cd.NumericCategory = nc.NumericCategory;

db<>fiddle

And the query plan looks nice and neat

Query Plan

You could probably implement the indexed view with triggers instead, but it won't be as efficient. Your other option is an SQLCLR aggregate function, which you use as a windowed aggregate.

Riley Major Riley Major 1,9654 gold badges19 silver badges37 bronze badges · Answer 2 · 2022-06-13 21:40:57Z

Instead of only filling in the missing data, first ensure you have every possible combination of the categories by doing a CROSS JOIN of the entire data table with the known category list.

You'll end up with a mess of duplicated data, but you'll also have at least one of every combination.

Then, make calculated columns for your data values, setting them to NULL if they're not the "real" values for that combination.

You'll end up with one non-forced-NULL value for each category combination and then a bunch of forced-NULL values for that same combination.

Text Num NumFromList Val CalcVal
A 1 1 bleep bleep
A 1 2 bleep NULL
A 1 3 bleep NULL
A 2 1 blorp NULL
A 2 2 blorp blorp
A 2 3 blorp NULL
...

Now you need to ignore all of the "bad" values for each of the combinations. You can't just exclude all of the NULLs, because then you're back to where you started.

Instead, you need to find a single value for each combination of categories. If there's any non-NULL value, you want that. If there are only NULLs, then you want NULL.

So GROUP BY the categories and choose the MAX value from your calculated field. The excess NULLs will fall away, leaving only good values where they exist and NULLs where they don't.

Here's the SQL using the examples tables:

WITH CategoryData AS
(
 SELECT
 *
 FROM (
 VALUES
 ('A', 1, 'bleep'),
 ('A', 2, 'blorp'),
 ('B', 2, 'blarp'),
 ('C', 3, 'blurp')
 ) AS t1(TextCategory, NumericCategory, Val)
),
NumericCategoryList AS
(
 SELECT
 *
 FROM (
 VALUES
 (1),
 (2),
 (3)
 ) AS t2(NumericCategory)
)
SELECT
 CategoryData.TextCategory,
 NumericCategoryList.NumericCategory,
 Val = MAX(CalcVals.CalcVal)
FROM CategoryData
CROSS JOIN NumericCategoryList
CROSS APPLY (
 SELECT
 CalcVal = 
 CASE
 WHEN CategoryData.NumericCategory = NumericCategoryList.NumericCategory THEN 
 CategoryData.Val
 ELSE
 NULL
 END
 ) AS CalcVals
GROUP BY CategoryData.TextCategory,
 NumericCategoryList.NumericCategory;

Using SSMS to compare the question's example query with the double dip to this version shows a very slight advantage to this version (0.115337 vs 0.115049).

Query plan comparison showing more complexity with the question's example (top) than the answer's query (bottom)

However, I suspect this will be sensitive to the actual table values and quantities of categories.

Stack Exchange Network

Fill in missing values without double dip to data table

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Fill in missing values without double dip to data table

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions