Keeping only maximum date rows in a group

Question 1

The following SQL code keeps only the MAX(date) rows with the same id and question values. I would like to know if there is a simpler/ shorter syntax returning the same result.

with 
tbl_src as (select * from `tests2.o1.mc` order by id, date),
tbl_max_date as (
 select
 id,
 question,
 MAX(date) as max_date
 from
 `tests2.o1.mc`
 group by
 id,
 question
)
select 
 tbl_src.*
from
 tbl_src
inner join
 tbl_max_date
on
 tbl_src.id = tbl_max_date.id
 and tbl_src.question = tbl_max_date.question
 and tbl_src.date = tbl_max_date.max_date

The original data:

id	date	question	answers
1	2018年03月21日	q1	"[""n1"",""n3""]"
1	2018年12月10日	q1	"[""n1"",""n2"",""n3""]"
1	2018年03月21日	q2	"[""N1"",""n3""]"
1	2018年12月10日	q2	"[""n1"",""n3""]"
1	2018年03月21日	q3	"[""N1""]"
1	2018年12月10日	q3	"[""n2""]"
2	2018年03月29日	q1	"[""n1"",""n3""]"
2	2018年06月01日	q1	"[""n1"",""n2"",""n3""]"
2	2018年06月02日	q1	"[""n1"",""n3""]"
2	2018年06月01日	q2	"[""n1"",""N2""]"
2	2018年06月01日	q3	"[""n3""]"
3	2018年03月14日	q1	"[""n2"",""n3""]"
3	2018年03月26日	q2	"[""n1""]"
3	2018年03月14日	q3	"[""n3""]"

The result:

id	date	question	answers
1	2018年12月10日	q1	"[""n1"",""n2"",""n3""]"
1	2018年12月10日	q2	"[""n1"",""n3""]"
1	2018年12月10日	q3	"[""n2""]"
2	2018年06月02日	q1	"[""n1"",""n3""]"
2	2018年06月01日	q2	"[""n1"",""N2""]"
2	2018年06月01日	q3	"[""n3""]"
3	2018年03月14日	q1	"[""n2"",""n3""]"
3	2018年03月26日	q2	"[""n1""]"
3	2018年03月14日	q3	"[""n3""]"

Question 2

You can use ROW_NUMBER to rank your data according to date for each combination of id and question; then simply select the row with a ROW_NUMBER of 1:

WITH tbl_max_date AS (
 SELECT *,
 ROW_NUMBER() OVER (PARTITION BY id, question ORDER BY date DESC) AS rn
 FROM tests2.o1.mc
)
SELECT *
FROM tbl_max_date
WHERE rn = 1

If you could have more than one row with the same maximum value per group, you can use RANK in place of ROW_NUMBER, as that will give all rows with the same value the same ranking. For example:

WITH tbl_max_date AS (
 SELECT *,
 RANK() OVER (PARTITION BY id, question ORDER BY date DESC) aS rn
 FROM tbl_src
)
SELECT *
FROM tbl_max_date
WHERE rn = 1

Question 3

I can't speak for Google BigQuery, but in other databases common table expressions impose an optimization boundary and subqueries can perform better; so consider dropping your with.

Is the only purpose of tbl_src to do an order by? It seems so. It's in somewhat of a backwards place, because order by can only be guaranteed to be preserved at the outer level of a query and not after a join, and anything else that works is "by accident".

Try the following:

select *
from (
 select id, question, answers, max(date) as max_date
 from `tests2.o1.mc`
 group by id, question, answers
)
order by id, max_date

Question 4

Thank you for extremely valuable comments! However the proposed query does not return the last field (answers). The reason why I first grouped and then joined was to return more fields than just id, question and max_date. About ordering, you are correct. One question from my side: when would it be appropriate to use with? Is it cases when I would use the same query more than once?

Question 5

I've edited to include answers, which simply needs to be in the group by. Regarding with: the answer is basically "when you can't subquery", which - yes - includes the case where the subquery needs to be reused.

Question 6

Some answers in id-question groups differ, so instead of 9 records I get 13. I guess I should use the previous version + a join.

Question 7

Let's discuss in chat.stackexchange.com/rooms/117358/…

Nick Nick 1567 bronze badges · Accepted Answer · 2020-12-16 10:35:14Z

You can use ROW_NUMBER to rank your data according to date for each combination of id and question; then simply select the row with a ROW_NUMBER of 1:

WITH tbl_max_date AS (
 SELECT *,
 ROW_NUMBER() OVER (PARTITION BY id, question ORDER BY date DESC) AS rn
 FROM tests2.o1.mc
)
SELECT *
FROM tbl_max_date
WHERE rn = 1

If you could have more than one row with the same maximum value per group, you can use RANK in place of ROW_NUMBER, as that will give all rows with the same value the same ranking. For example:

WITH tbl_max_date AS (
 SELECT *,
 RANK() OVER (PARTITION BY id, question ORDER BY date DESC) aS rn
 FROM tbl_src
)
SELECT *
FROM tbl_max_date
WHERE rn = 1

Stack Exchange Network

Keeping only maximum date rows in a group

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Keeping only maximum date rows in a group

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions