The following SQL code keeps only the MAX(date) rows with the same id and question values. I would like to know if there is a simpler/ shorter syntax returning the same result.
with
tbl_src as (select * from `tests2.o1.mc` order by id, date),
tbl_max_date as (
select
id,
question,
MAX(date) as max_date
from
`tests2.o1.mc`
group by
id,
question
)
select
tbl_src.*
from
tbl_src
inner join
tbl_max_date
on
tbl_src.id = tbl_max_date.id
and tbl_src.question = tbl_max_date.question
and tbl_src.date = tbl_max_date.max_date
The original data:
id | date | question | answers |
---|---|---|---|
1 | 2018年03月21日 | q1 | "[""n1"",""n3""]" |
1 | 2018年12月10日 | q1 | "[""n1"",""n2"",""n3""]" |
1 | 2018年03月21日 | q2 | "[""N1"",""n3""]" |
1 | 2018年12月10日 | q2 | "[""n1"",""n3""]" |
1 | 2018年03月21日 | q3 | "[""N1""]" |
1 | 2018年12月10日 | q3 | "[""n2""]" |
2 | 2018年03月29日 | q1 | "[""n1"",""n3""]" |
2 | 2018年06月01日 | q1 | "[""n1"",""n2"",""n3""]" |
2 | 2018年06月02日 | q1 | "[""n1"",""n3""]" |
2 | 2018年06月01日 | q2 | "[""n1"",""N2""]" |
2 | 2018年06月01日 | q3 | "[""n3""]" |
3 | 2018年03月14日 | q1 | "[""n2"",""n3""]" |
3 | 2018年03月26日 | q2 | "[""n1""]" |
3 | 2018年03月14日 | q3 | "[""n3""]" |
The result:
id | date | question | answers |
---|---|---|---|
1 | 2018年12月10日 | q1 | "[""n1"",""n2"",""n3""]" |
1 | 2018年12月10日 | q2 | "[""n1"",""n3""]" |
1 | 2018年12月10日 | q3 | "[""n2""]" |
2 | 2018年06月02日 | q1 | "[""n1"",""n3""]" |
2 | 2018年06月01日 | q2 | "[""n1"",""N2""]" |
2 | 2018年06月01日 | q3 | "[""n3""]" |
3 | 2018年03月14日 | q1 | "[""n2"",""n3""]" |
3 | 2018年03月26日 | q2 | "[""n1""]" |
3 | 2018年03月14日 | q3 | "[""n3""]" |
2 Answers 2
You can use ROW_NUMBER
to rank your data according to date
for each combination of id
and question
; then simply select the row with a ROW_NUMBER
of 1:
WITH tbl_max_date AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY id, question ORDER BY date DESC) AS rn
FROM tests2.o1.mc
)
SELECT *
FROM tbl_max_date
WHERE rn = 1
If you could have more than one row with the same maximum value per group, you can use RANK
in place of ROW_NUMBER
, as that will give all rows with the same value the same ranking. For example:
WITH tbl_max_date AS (
SELECT *,
RANK() OVER (PARTITION BY id, question ORDER BY date DESC) aS rn
FROM tbl_src
)
SELECT *
FROM tbl_max_date
WHERE rn = 1
I can't speak for Google BigQuery, but in other databases common table expressions impose an optimization boundary and subqueries can perform better; so consider dropping your with
.
Is the only purpose of tbl_src
to do an order by
? It seems so. It's in somewhat of a backwards place, because order by
can only be guaranteed to be preserved at the outer level of a query and not after a join
, and anything else that works is "by accident".
Try the following:
select *
from (
select id, question, answers, max(date) as max_date
from `tests2.o1.mc`
group by id, question, answers
)
order by id, max_date
-
\$\begingroup\$ Thank you for extremely valuable comments! However the proposed query does not return the last field (answers). The reason why I first grouped and then joined was to return more fields than just id, question and max_date. About ordering, you are correct. One question from my side: when would it be appropriate to use
with
? Is it cases when I would use the same query more than once? \$\endgroup\$ZygD– ZygD2020年12月16日 07:44:23 +00:00Commented Dec 16, 2020 at 7:44 -
\$\begingroup\$ I've edited to include
answers
, which simply needs to be in thegroup by
. Regardingwith
: the answer is basically "when you can't subquery", which - yes - includes the case where the subquery needs to be reused. \$\endgroup\$Reinderien– Reinderien2020年12月16日 16:15:14 +00:00Commented Dec 16, 2020 at 16:15 -
\$\begingroup\$ Some answers in id-question groups differ, so instead of 9 records I get 13. I guess I should use the previous version + a join. \$\endgroup\$ZygD– ZygD2020年12月16日 19:01:57 +00:00Commented Dec 16, 2020 at 19:01
-
\$\begingroup\$ Let's discuss in chat.stackexchange.com/rooms/117358/… \$\endgroup\$Reinderien– Reinderien2020年12月16日 19:44:10 +00:00Commented Dec 16, 2020 at 19:44