I have a table with two columns, I want to count the distinct values on Col_B over (conditioned by) Col_A.
MyTable
Col_A | Col_B
A | 1
A | 1
A | 2
A | 2
A | 2
A | 3
b | 4
b | 4
b | 5
Expected Result
Col_A | Col_B | Result
A | 1 | 3
A | 1 | 3
A | 2 | 3
A | 2 | 3
A | 2 | 3
A | 3 | 3
b | 4 | 2
b | 4 | 2
b | 5 | 2
I tried the following code
select *,
count (distinct col_B) over (partition by col_A) as 'Result'
from MyTable
count (distinct col_B) is not working. How can I rewrite the count function to count distinct values?
5 Answers 5
This is how I'd do it:
SELECT *
FROM #MyTable AS mt
CROSS APPLY ( SELECT COUNT(DISTINCT mt2.Col_B) AS dc
FROM #MyTable AS mt2
WHERE mt2.Col_A = mt.Col_A
-- GROUP BY mt2.Col_A
) AS ca;
The GROUP BY
clause is redundant given the data provided in the question, but may give you a better execution plan. See the follow-up Q & A CROSS APPLY produces outer join.
Consider voting for OVER clause enhancement request - DISTINCT clause for aggregate functions on the feedback site if you would like that feature added to SQL Server.
You can emulate it by using dense_rank
, and then pick the maximum rank for each partition:
select col_a, col_b, max(rnk) over (partition by col_a)
from (
select col_a, col_b
, dense_rank() over (partition by col_A order by col_b) as rnk
from #mytable
) as t
You would need to exclude any nulls from col_b
to get the same results as COUNT(DISTINCT)
.
This is, in a way, an extension to Lennart's solution, but it is so ugly that I dare not suggest it as an edit. The goal here is to get the results without a derived table. There may never be the need for that, and combined with the ugliness of the query the whole endeavour may seem like a wasted effort. I still wanted to do this as an exercise, though, and would now like to share my result:
SELECT
Col_A,
Col_B,
DistinctCount = DENSE_RANK() OVER (PARTITION BY Col_A ORDER BY Col_B ASC )
+ DENSE_RANK() OVER (PARTITION BY Col_A ORDER BY Col_B DESC)
- 1
- CASE COUNT(Col_B) OVER (PARTITION BY Col_A)
WHEN COUNT( * ) OVER (PARTITION BY Col_A)
THEN 0
ELSE 1
END
FROM
dbo.MyTable
;
The core part of the calculation is this (and I would first of all like to note that the idea is not mine, I learned about this trick elsewhere):
DENSE_RANK() OVER (PARTITION BY Col_A ORDER BY Col_B ASC )
+ DENSE_RANK() OVER (PARTITION BY Col_A ORDER BY Col_B DESC)
- 1
This expression can be used without any change if the values in Col_B
are guaranteed to never have nulls. If the column can have nulls, however, you need to account for that, and that is exactly what the CASE
expression is there for. It compares the number of rows per partition with the number of Col_B
values per partition. If the numbers differ, it means that some rows have a null in Col_B
and, therefore, the initial calculation (DENSE_RANK() ... + DENSE_RANK() - 1
) needs to be reduced by 1.
Note that because the - 1
is part of the core formula, I chose to leave it like that. However, it can actually be incorporated into the CASE
expression, in the futile attempt to make the entire solution look less ugly:
SELECT
Col_A,
Col_B,
DistinctCount = DENSE_RANK() OVER (PARTITION BY Col_A ORDER BY Col_B ASC )
+ DENSE_RANK() OVER (PARTITION BY Col_A ORDER BY Col_B DESC)
- CASE COUNT(Col_B) OVER (PARTITION BY Col_A)
WHEN COUNT( * ) OVER (PARTITION BY Col_A)
THEN 1
ELSE 2
END
FROM
dbo.MyTable
;
This live demo at dbfiddle logodb<>fiddle.uk can be used to test both variations of the solution.
create table #MyTable (
Col_A varchar(5),
Col_B int
)
insert into #MyTable values ('A',1)
insert into #MyTable values ('A',1)
insert into #MyTable values ('A',2)
insert into #MyTable values ('A',2)
insert into #MyTable values ('A',2)
insert into #MyTable values ('A',3)
insert into #MyTable values ('B',4)
insert into #MyTable values ('B',4)
insert into #MyTable values ('B',5)
;with t1 as (
select t.Col_A,
count(*) cnt
from (
select Col_A,
Col_B,
count(*) as ct
from #MyTable
group by Col_A,
Col_B
) t
group by t.Col_A
)
select a.*,
t1.cnt
from #myTable a
join t1
on a.Col_A = t1.Col_a
Alternative if you're mildly allergic to correlated subqueries (Erik Darling's answer) and CTEs (kevinnwhat's answer) like me.
Be aware that when nulls are thrown in to the mix, none of these may work how you would like them to. (but it's fairly simple to modify them to taste)
Simple case:
--ignore the existence of nulls
SELECT [mt].*, [Distinct_B].[Distinct_B]
FROM #MyTable AS [mt]
INNER JOIN(
SELECT [Col_A], COUNT(DISTINCT [Col_B]) AS [Distinct_B]
FROM #MyTable
GROUP BY [Col_A]
) AS [Distinct_B] ON
[mt].[Col_A] = [Distinct_B].[Col_A]
;
Same as above, but with comments on what to change for null handling:
--customizable null handling
SELECT [mt].*, [Distinct_B].[Distinct_B]
FROM #MyTable AS [mt]
INNER JOIN(
SELECT
[Col_A],
(
COUNT(DISTINCT [Col_B])
/*
--uncomment if you also want to count Col_B NULL
--as a distinct value
+
MAX(
CASE
WHEN [Col_B] IS NULL
THEN 1
ELSE 0
END
)
*/
)
AS [Distinct_B]
FROM #MyTable
GROUP BY [Col_A]
) AS [Distinct_B] ON
[mt].[Col_A] = [Distinct_B].[Col_A]
/*
--uncomment if you also want to include Col_A when it's NULL
OR
([mt].[Col_A] IS NULL AND [Distinct_B].[Col_A] IS NULL)
*/