I am trying to construct a query in PostgreSQL 9.0 that gets the longest sequence of continuous rows for a specific column.
Consider the following table:
lap_id (serial), lap_no (int), car_type (enum), race_id (int FK)
Where lap_no
is unique for each (race_id, car_type)
.
I would like the query to produce the longest sequence for a given race_id
and car_type
, so it would return an int
(or long) that is highest.
With the following data:
1, 1, red, 1
2, 2, red, 1
3, 3, red, 1
4, 4, red, 1
5, 1, blue, 1
6, 5, red, 1
7, 2, blue, 1
8, 1, green, 1
For car_type = red and race_id = 1
the query would return 5
as the longest sequence of the lap_no
field.
I found a similar question here however my situation is a bit more straightforward.
(I would also like to know the longest sequence for a given car_type
for all races, but was planning to work that out myself.)
2 Answers 2
Your description results in a table definition like this:
CREATE TABLE tbl (
lap_id serial PRIMARY KEY
, lap_no int NOT NULL
, car_type enum NOT NULL
, race_id int NOT NULL -- REFERENCES ...
, UNIQUE(race_id, car_type, lap_no)
);
General solution for this class of problems
To get the longest sequence (1 result, longest of all, arbitrary pick if there are ties):
SELECT race_id, car_type, count(*) AS seq_len
FROM (
SELECT *, count(*) FILTER (WHERE step)
OVER (ORDER BY race_id, car_type, lap_no) AS grp
FROM (
SELECT *, (lag(lap_no) OVER (PARTITION BY race_id, car_type ORDER BY lap_no) + 1)
IS DISTINCT FROM lap_no AS step
FROM tbl
) x
) y
GROUP BY race_id, car_type, grp
ORDER BY seq_len DESC
LIMIT 1;
count(*) FILTER (WHERE step)
only counts TRUE
(= step to next group), which results in a new number for every new group.
Related procedural solution with plpgsql:
If the top requirement is performance, the plpgsql function is typically faster in this particular case because it can calculate the result in a single scan.
Faster for consecutive numbers
We can capitalize on the fact that consecutive lap_no
define a sequence, for a much simpler and faster version:
SELECT race_id, car_type, count(*) AS seq_len
FROM (
SELECT race_id, car_type
, row_number() OVER (PARTITION BY race_id, car_type ORDER BY lap_no) - lap_no AS grp
FROM tbl
) x
GROUP BY race_id, car_type, grp
ORDER BY seq_len DESC
LIMIT 1;
Consecutive laps end up in the same grp
. Every missing lap results in a lower grp
per partition.
This relies on (race_id, car_type, lap_no)
being UNIQUE NOT NULL
. NULL values or duplicates could break the logic.
Discussion of Jack's simpler alternative
@Jack's version effectively counts all laps (rows) where the previous lap_no
in this race_id
had the same car_type
. That's simpler and faster and correct - as long as each car_type
can only have one sequence per race_id
.
But for a task that simple the query could be simpler, yet. It would follow logically that all lap_no
per (car_type, race_id)
must be in sequence, and we could just count the laps:
SELECT race_id, car_type, count(*) AS seq_len
FROM tbl
GROUP BY race_id, car_type
ORDER BY seq_len DESC
LIMIT 1;
If, on the other hand, one car_type
can have multiple separate sequences per race_id (and the question does not specify otherwise), Jack's version will fail.
Faster for a given race / car type
In reply to the comment / clarifications in the question: restricting the query to one given (race_id, car_type)
will make it much faster, of course:
SELECT count(*) AS seq_len
FROM (
SELECT row_number() OVER (ORDER BY lap_no) - lap_no AS grp
FROM tbl
WHERE race_id = 1
AND car_type = 'red'
) x
GROUP BY grp
ORDER BY seq_len DESC
LIMIT 1;
db<>fiddle here
Old SQL Fiddle
Index
Key to top performance is a fitting index (except for the mentioned procedural solution working with a single sequential scan). A multicolumn index like this serves best:
CREATE INDEX tbl_mult_idx ON tbl (race_id, car_type, lap_no);
If your table has the UNIQUE
constraint I assumed at the top, that is implemented with just this (unique) index internally, and you do not need to create another index.
-
Hi Erwin, thanks that does the job, however it takes ~17sec on my database! Dont suppose you could provide a modification so it takes race_id and car_type as parameters rather than comparing the entire table? (I have tried re-writing it and keep running into errors)DaveB– DaveB2013年02月25日 19:28:49 +00:00Commented Feb 25, 2013 at 19:28
create table tbl (lap_no int, car_type text, race_id int);
insert into tbl values (1,'red',1),(2,'red',1),(3,'red',1),(4,'red',1), (1,'blue',1),(5,'red',1),(2,'blue',1),(1,'green',1);
select car_type, race_id, sum(case when lap_no=(prev+1) then 1 else 0 end)+1 seq_len from ( select *, lag(lap_no) over (partition by car_type, race_id order by lap_no) prev from tbl ) z group by car_type, race_id order by seq_len desc limit 1;
/* |car_type|race_id|seq_len| |:-------|------:|------:| |red | 1| 5| */
-
or perhaps
sum((lap_no=(prev+1))::integer)+1
but I'm not sure that is easier to readJack Douglas– Jack Douglas2013年02月25日 20:08:53 +00:00Commented Feb 25, 2013 at 20:08
Explore related questions
See similar questions with these tags.