Suppose there are two lists of categories and then a data table with combinations of those categories, along with various other values.
Suppose that there can only be one instance of any category combination in the data table.
We want a list of every possible category combination along with the values, if any, from the data table.
This is relatively straightforward if we have a list of each set of categories. We can CROSS JOIN the category lists together and LEFT JOIN over to the data.
But what if one of the category lists doesn't separately exist? Rather, the categories only exist as entries in the data table.
One way to handle this is to hit the data table first to get the distinct list of one type of categories, CROSS JOIN that to the other categories, and then LEFT JOIN back into the database table to fill in the values, as before.
For example:
WITH CategoryData AS
(
SELECT
*
FROM (
VALUES
('A', 1, 'bleep'),
('A', 2, 'blorp'),
('B', 2, 'blarp'),
('C', 3, 'blurp')
) AS t1(TextCategory, NumericCategory, Val)
),
NumericCategoryList AS
(
SELECT
*
FROM (
VALUES
(1),
(2),
(3)
) AS t2(NumericCategory)
)
SELECT
TextCategoryList.TextCategory,
NumericCategoryList.NumericCategory,
CategoryData.Val
FROM (
SELECT
DISTINCT
TextCategory
FROM CategoryData
) AS TextCategoryList
CROSS JOIN NumericCategoryList
LEFT JOIN CategoryData
ON CategoryData.TextCategory = TextCategoryList.TextCategory
AND CategoryData.NumericCategory = NumericCategoryList.NumericCategory;
That can be expensive with a large data table.
With date ranges or small number tables, you can generate the Cartesian product of all category combination possibilities less expensively (and with computation only), but if the categories are larger, arbitrary values, that's not practical.
So is there a way for the engine to go over the values of the data table once, generating any missing values by looping over the known categories as it goes?
The categories are independent and the values are arbitrary. Perhaps the textual Category is someone's name, the numeric Category is the ID of the snack type they ate, and the value is the sound of their burp.
Not all people ate all snacks, but I want a list of each person with every possible snack, and their burp sound when they ate that snack (with NULL if they didn't eat that snack). (Let's pretend there are infinitely many possible transcriptions of burp sounds; there isn't a list of burp sounds.)
This involves a third party vendor product. I am hoping there is some fancy window logic which I couldn’t figure out.
2 Answers 2
One neat solution is to create an indexed view which pre-aggregates the list of TextCategory
.
Some caveats with indexed views:
- To use aggregation, you cannot use
DISTINCT
, you must useGROUP BY
and you must have aCOUNT_BIG(*)
column. - The view must be schema-bound, so you can't drop the base columns.
- The only join allowed is an
INNER
, and no derived tables, subqueries, CTEs or TVFs. - It's normally better to access an indexed view using
WITH (NOEXPAND)
, for various reasons. To enforce this, I create a separate, normal view, which selects from the indexed view with that hint. You don't have to do this step, you could putNOEXPAND
directly into your query.
CREATE VIEW dbo.TextCategoryList_Indexed
WITH SCHEMABINDING
AS
SELECT
cd.TextCategory,
COUNT_BIG(*) Count -- must have COUNT_BIG
FROM dbo.CategoryData cd
GROUP BY
cd.TextCategory;
CREATE UNIQUE CLUSTERED INDEX CX_TextCategoryList ON TextCategoryList_Indexed (TextCategory);
Now to make sure to use NOEXPAND
we create another view:
CREATE VIEW dbo.TextCategoryList
AS
SELECT *
FROM dbo.TextCategoryList_Indexed WITH (NOEXPAND)
Then we can run your exact query using it:
SELECT
tc.TextCategory,
nc.NumericCategory,
cd.Val
FROM TextCategoryList tc
CROSS JOIN NumericCategoryList nc
LEFT JOIN CategoryData cd
ON cd.TextCategory = tc.TextCategory
AND cd.NumericCategory = nc.NumericCategory;
And the query plan looks nice and neat
You could probably implement the indexed view with triggers instead, but it won't be as efficient. Your other option is an SQLCLR aggregate function, which you use as a windowed aggregate.
Instead of only filling in the missing data, first ensure you have every possible combination of the categories by doing a CROSS JOIN of the entire data table with the known category list.
You'll end up with a mess of duplicated data, but you'll also have at least one of every combination.
Then, make calculated columns for your data values, setting them to NULL if they're not the "real" values for that combination.
You'll end up with one non-forced-NULL value for each category combination and then a bunch of forced-NULL values for that same combination.
Text Num NumFromList Val CalcVal
A 1 1 bleep bleep
A 1 2 bleep NULL
A 1 3 bleep NULL
A 2 1 blorp NULL
A 2 2 blorp blorp
A 2 3 blorp NULL
...
Now you need to ignore all of the "bad" values for each of the combinations. You can't just exclude all of the NULLs, because then you're back to where you started.
Instead, you need to find a single value for each combination of categories. If there's any non-NULL value, you want that. If there are only NULLs, then you want NULL.
So GROUP BY the categories and choose the MAX value from your calculated field. The excess NULLs will fall away, leaving only good values where they exist and NULLs where they don't.
Here's the SQL using the examples tables:
WITH CategoryData AS
(
SELECT
*
FROM (
VALUES
('A', 1, 'bleep'),
('A', 2, 'blorp'),
('B', 2, 'blarp'),
('C', 3, 'blurp')
) AS t1(TextCategory, NumericCategory, Val)
),
NumericCategoryList AS
(
SELECT
*
FROM (
VALUES
(1),
(2),
(3)
) AS t2(NumericCategory)
)
SELECT
CategoryData.TextCategory,
NumericCategoryList.NumericCategory,
Val = MAX(CalcVals.CalcVal)
FROM CategoryData
CROSS JOIN NumericCategoryList
CROSS APPLY (
SELECT
CalcVal =
CASE
WHEN CategoryData.NumericCategory = NumericCategoryList.NumericCategory THEN
CategoryData.Val
ELSE
NULL
END
) AS CalcVals
GROUP BY CategoryData.TextCategory,
NumericCategoryList.NumericCategory;
Using SSMS to compare the question's example query with the double dip to this version shows a very slight advantage to this version (0.115337 vs 0.115049).
However, I suspect this will be sensitive to the actual table values and quantities of categories.