I have a situation I think can be solved using window function but I'm not sure.
Imagine the following table
CREATE TABLE tmp (
date timestamp
, id_type integer
) ;
INSERT INTO tmp (date, id_type)
VALUES
( '2017-01-10 07:19:21.0', 3 ),
( '2017-01-10 07:19:22.0', 3 ),
( '2017-01-10 07:19:23.1', 3 ),
( '2017-01-10 07:19:24.1', 3 ),
( '2017-01-10 07:19:25.0', 3 ),
( '2017-01-10 07:19:26.0', 5 ),
( '2017-01-10 07:19:27.1', 3 ),
( '2017-01-10 07:19:28.0', 5 ),
( '2017-01-10 07:19:29.0', 5 ),
( '2017-01-10 07:19:30.1', 3 ),
( '2017-01-10 07:19:31.0', 5 ),
( '2017-01-10 07:19:32.0', 3 ),
( '2017-01-10 07:19:33.1', 5 ),
( '2017-01-10 07:19:35.0', 5 ),
( '2017-01-10 07:19:36.1', 5 ),
( '2017-01-10 07:19:37.1', 5 );
I'd like to have a new group at each change of value in column id_type
.
E.G. 1st group from 7:19:21 to 7:19:25, 2nd starting and finishing at 7:19:26, and so on.
At this moment, using the query below ...
SELECT distinct
min(min(date)) over w as begin,
max(max(date)) over w as end,
id_type
FROM tmp
GROUP BY id_type
WINDOW w AS (PARTITION BY id_type)
ORDER BY begin;
I get the following result:
begin end id_type
2017年01月10日 07:19:21.0 2017年01月10日 07:19:32.0 3
2017年01月10日 07:19:26.0 2017年01月10日 07:19:37.1 5
While I'd like:
begin end id_type
2017年01月10日 07:19:21.0 2017年01月10日 07:19:25.0 3
2017年01月10日 07:19:26.0 2017年01月10日 07:19:26.0 5
2017年01月10日 07:19:27.1 2017年01月10日 07:19:27.1 3
2017年01月10日 07:19:28.0 2017年01月10日 07:19:29.0 5
2017年01月10日 07:19:30.1 2017年01月10日 07:19:30.1 3
2017年01月10日 07:19:31.0 2017年01月10日 07:19:31.0 5
2017年01月10日 07:19:32.0 2017年01月10日 07:19:32.0 3
2017年01月10日 07:19:33.1 2017年01月10日 07:19:37.1 5
Once that works, I want to include more criteria to define groups, and these others will be nullable.
Postgres Version: 8.4. We have Postgres with PostGis, so it is not easy to upgrade. PostGis functions change names and there are other problems, but we are already rewriting everything and the new version will use a newer version 9.X with PostGis 2.x.
-
2General solution: dba.stackexchange.com/questions/35380/…Erwin Brandstetter– Erwin Brandstetter2017年03月06日 23:00:44 +00:00Commented Mar 6, 2017 at 23:00
7 Answers 7
For a few points,
- Don't call a non-temporary table
tmp
that just gets confusing. - Don't use text for timestamps (you're doing that in your example we can tell because the timestamp didn't get truncated and has
.0
) - Don't call a field that has time in it
date
. If it has date and time, it's a timestamp (and store it as one)
Better to use a window function..
SELECT id_type, grp, min(date), max(date)
FROM (
SELECT date, id_type, count(is_reset) OVER (ORDER BY date) AS grp
FROM (
SELECT date, id_type, CASE WHEN lag(id_type) OVER (ORDER BY date) <> id_type THEN 1 END AS is_reset
FROM tmp
) AS t
) AS g
GROUP BY id_type, grp
ORDER BY min(date);
Outputs
id_type | grp | min | max
---------+-----+-----------------------+-----------------------
3 | 0 | 2017年01月10日 07:19:21.0 | 2017年01月10日 07:19:25.0
5 | 1 | 2017年01月10日 07:19:26.0 | 2017年01月10日 07:19:26.0
3 | 2 | 2017年01月10日 07:19:27.1 | 2017年01月10日 07:19:27.1
5 | 3 | 2017年01月10日 07:19:28.0 | 2017年01月10日 07:19:29.0
3 | 4 | 2017年01月10日 07:19:30.1 | 2017年01月10日 07:19:30.1
5 | 5 | 2017年01月10日 07:19:31.0 | 2017年01月10日 07:19:31.0
3 | 6 | 2017年01月10日 07:19:32.0 | 2017年01月10日 07:19:32.0
5 | 7 | 2017年01月10日 07:19:33.1 | 2017年01月10日 07:19:37.1
(8 rows)
Explaination
First we need resets.. We generate them with lag()
SELECT date, id_type, CASE WHEN lag(id_type) OVER (ORDER BY date) <> id_type THEN 1 END AS is_reset
FROM tmp
ORDER BY date;
date | id_type | is_reset
-----------------------+---------+----------
2017年01月10日 07:19:21.0 | 3 |
2017年01月10日 07:19:22.0 | 3 |
2017年01月10日 07:19:23.1 | 3 |
2017年01月10日 07:19:24.1 | 3 |
2017年01月10日 07:19:25.0 | 3 |
2017年01月10日 07:19:26.0 | 5 | 1
2017年01月10日 07:19:27.1 | 3 | 1
2017年01月10日 07:19:28.0 | 5 | 1
2017年01月10日 07:19:29.0 | 5 |
2017年01月10日 07:19:30.1 | 3 | 1
2017年01月10日 07:19:31.0 | 5 | 1
2017年01月10日 07:19:32.0 | 3 | 1
2017年01月10日 07:19:33.1 | 5 | 1
2017年01月10日 07:19:35.0 | 5 |
2017年01月10日 07:19:36.1 | 5 |
2017年01月10日 07:19:37.1 | 5 |
(16 rows)
Then we count to get groups.
SELECT date, id_type, count(is_reset) OVER (ORDER BY date) AS grp
FROM (
SELECT date, id_type, CASE WHEN lag(id_type) OVER (ORDER BY date) <> id_type THEN 1 END AS is_reset
FROM tmp
ORDER BY date
) AS t
ORDER BY date
date | id_type | grp
-----------------------+---------+-----
2017年01月10日 07:19:21.0 | 3 | 0
2017年01月10日 07:19:22.0 | 3 | 0
2017年01月10日 07:19:23.1 | 3 | 0
2017年01月10日 07:19:24.1 | 3 | 0
2017年01月10日 07:19:25.0 | 3 | 0
2017年01月10日 07:19:26.0 | 5 | 1
2017年01月10日 07:19:27.1 | 3 | 2
2017年01月10日 07:19:28.0 | 5 | 3
2017年01月10日 07:19:29.0 | 5 | 3
2017年01月10日 07:19:30.1 | 3 | 4
2017年01月10日 07:19:31.0 | 5 | 5
2017年01月10日 07:19:32.0 | 3 | 6
2017年01月10日 07:19:33.1 | 5 | 7
2017年01月10日 07:19:35.0 | 5 | 7
2017年01月10日 07:19:36.1 | 5 | 7
2017年01月10日 07:19:37.1 | 5 | 7
(16 rows)
Then we wrap in a subselect GROUP BY
and ORDER
and select the min max (range)
SELECT id_type, grp, min(date), max(date)
FROM (
.. stuff
) AS g
GROUP BY id_type, grp
ORDER BY min(date);
1. Window functions plus subqueries
Count the steps to form groups, similar to Evan's idea, with modifications and fixes:
SELECT id_type
, min(date) AS begin
, max(date) AS end
, count(*) AS row_ct -- optional addition
FROM (
SELECT date, id_type, count(step OR NULL) OVER (ORDER BY date) AS grp
FROM (
SELECT date, id_type
, lag(id_type, 1, id_type) OVER (ORDER BY date) <> id_type AS step
FROM tmp
) sub1
) sub2
GROUP BY id_type, grp
ORDER BY min(date);
This assumes involved columns are NOT NULL
. Else you need to do more.
Also assuming date
to be defined UNIQUE
, else you need to add a tiebreaker to the ORDER BY
clauses get deterministic results. Like: ORDER BY date, id
.
Detailed explanation (answer to very similar question):
Note in particular:
In related cases,
lag()
with 3 parameters can be essential to cover the corner case of the first (or last) row elegantly. (The 3rd param is used as default if there is no previous (next) row.lag(id_type, 1, id_type) OVER ()
Since we are only interested in an actual change of id_type
(TRUE
), it does not matter in this particular case. NULL
and FALSE
both don't count as step
.
count(step OR NULL) OVER (ORDER BY date)
is the shortest syntax that also works in Postgres 9.3 or older.count()
only counts non-null values ...
In modern Postgres, the cleaner, equivalent syntax would be:
count(step) FILTER (WHERE step) OVER (ORDER BY date)
Details:
2. Subtract two window functions, one subquery
Similar to Erik's idea with modifications:
SELECT min(date) AS begin
, max(date) AS end
, id_type
FROM (
SELECT date, id_type
, row_number() OVER (ORDER BY date)
- row_number() OVER (PARTITION BY id_type ORDER BY date) AS grp
FROM tmp
) sub
GROUP BY id_type, grp
ORDER BY min(date);
If date
is defined UNIQUE
, like I mention above, dense_rank()
would be pointless, since the result is the same as for row_number()
and the latter is substantially cheaper.
If date
is not defined UNIQUE
(and we don't know that the only duplicates are on (date, id_type)
), all of these queries are pointless, since the result is arbitrary.
Also, a subquery is typically cheaper than a CTE in Postgres. Only use CTEs when you need them.
Related answers with more explanation:
- GROUP BY and aggregate sequential numeric values
- Group by repeating attribute
- GROUP BY uninterrupted sequence of logs for same location
In related cases where we already have a running number in the table, we can make do with a single window function:
3. Top performance with plpgsql function
Since this question has become unexpectedly popular, I'll add another solution to demonstrate top performance.
SQL has many sophisticated tools to create solutions with short and elegant syntax. But a declarative language has its limits for more complex requirements that involve procedural elements.
A procedural solution with a server-side function is faster for this than anything posted so far because it only needs a single sequential scan over the table and a single sort operation. If a fitting index is available, even just a single index-only scan.
CREATE OR REPLACE FUNCTION f_tmp_groups()
RETURNS TABLE (id_type int, grp_begin timestamp, grp_end timestamp)
LANGUAGE plpgsql AS
$func$
DECLARE
_row tmp; -- use table type for row variable
BEGIN
FOR _row IN
TABLE tmp ORDER BY date -- add more columns for deterministic order
LOOP
CASE _row.id_type = id_type
WHEN TRUE THEN -- same group continues
grp_end := _row.date; -- remember last date so far
WHEN FALSE THEN -- next group starts
RETURN NEXT; -- return result for last group
id_type := _row.id_type;
grp_begin := _row.date;
grp_end := _row.date;
ELSE -- NULL for 1st row
id_type := _row.id_type; -- remember row data for starters
grp_begin := _row.date;
grp_end := _row.date;
END CASE;
END LOOP;
RETURN NEXT; -- return last result row
END
$func$;
Call:
SELECT * FROM f_tmp_groups();
Test with:
EXPLAIN (ANALYZE, TIMING OFF) -- to focus on total performance
SELECT * FROM f_tmp_groups();
You could make the function generic with polymorphic types and pass table type and column names. Details:
If you don't want to or cannot persist a function for this, it would even pay to create a temporary function on the fly. Costs a few ms.
db<>fiddle here - comparing performance for all three. (Building on Jack's test case, modified.)
dbfiddle for Postgres 8.4, where performance differences are even bigger. (Not operational any more.)
-
Read this a few times - still unsure of what you're talking about with the three argument lag or when you'd have to use
count(x or null)
or even what it's doing there. Perhaps you could show some samples where it is required, because it's not required here. And, what would key the requirement to cover those corner cases. BTW, I changed my downvote to the upvote just for the pl/pgsql example. That's really cool. (But, generally I'm against answers that summarize other answers or cover corner cases -- though I hate to say that this is a corner case because I don't understand it).Evan Carroll– Evan Carroll2017年03月14日 05:39:18 +00:00Commented Mar 14, 2017 at 5:39 -
I would put them in two seperate self-answered questions because I'm sure I'm not the only one wondering what
count(x or null)
does. I'll be happy to ask both questions if you would prefer.Evan Carroll– Evan Carroll2017年03月14日 05:44:06 +00:00Commented Mar 14, 2017 at 5:44 -
Here is one question In what case is a
count(x or null)
needed in the Gaps and Islands?Evan Carroll– Evan Carroll2017年03月14日 05:57:17 +00:00Commented Mar 14, 2017 at 5:57
You can do this as a simple subtraction of ROW_NUMBER()
operations (or if your dates are not unique, though still unique per id_type
, then you can use DENSE_RANK()
instead, though it will be a more expensive query):
WITH IdTypes AS (
SELECT
date,
id_type,
Row_Number() OVER (ORDER BY date)
- Row_Number() OVER (PARTITION BY id_type ORDER BY date)
AS Seq
FROM
tmp
)
SELECT
Min(date) AS begin,
Max(date) AS end,
id_type
FROM IdTypes
GROUP BY id_type, Seq
ORDER BY begin
;
See this work at DB Fiddle (or see the DENSE_RANK version)
Result:
begin end id_type
--------------------- --------------------- -------
2017年01月10日 07:19:21 2017年01月10日 07:19:25 3
2017年01月10日 07:19:26 2017年01月10日 07:19:26 5
2017年01月10日 07:19:27.1 2017年01月10日 07:19:27.1 3
2017年01月10日 07:19:28 2017年01月10日 07:19:29 5
2017年01月10日 07:19:30.1 2017年01月10日 07:19:30.1 3
2017年01月10日 07:19:31 2017年01月10日 07:19:31 5
2017年01月10日 07:19:32 2017年01月10日 07:19:32 3
2017年01月10日 07:19:33.1 2017年01月10日 07:19:37.1 5
Logically, you can think of this as a simple DENSE_RANK()
with a PREORDER BY
, that is, you want the DENSE_RANK
of all the items that are ranked together, and you want them ordered by the dates, you just have to deal with the pesky problem of the fact that at each change in the date, DENSE_RANK
will increment. You do that by using the expression as I showed you above. Imagine if you had this syntax: DENSE_RANK() OVER (PREORDER BY date, ORDER BY id_type)
where the PREORDER
is excluded from the ranking calculation and only the ORDER BY
is counted.
Note that it's important to GROUP BY
both the generated Seq
column as well as the id_type
column. Seq
is NOT unique by itself, there can be overlaps--you must also group by id_type
.
For further reading on this topic:
- Detect changes between row values—read the See It For Yourself section.
- Or this simpler explanation
That first link gives you some code you can use if you wanted the begin or end date to be the same as the previous or next period's end/begin date (so there are no gaps). Plus other versions that could assist you in your query. Though they have to be translated from SQL Server syntax...
On Postgres 8.4 you can use a RECURSIVE function.
How do they do it
The recursive function adds a level to each different id_type, by selecting dates one by one on descending order.
date | id_type | lv
--------------------------------------
2017年01月10日 07:19:21.0 3 8
2017年01月10日 07:19:22.0 3 8
2017年01月10日 07:19:23.1 3 8
2017年01月10日 07:19:24.1 3 8
2017年01月10日 07:19:25.0 3 8
2017年01月10日 07:19:26.0 5 7
2017年01月10日 07:19:27.1 3 6
2017年01月10日 07:19:28.0 5 5
2017年01月10日 07:19:29.0 5 5
2017年01月10日 07:19:30.1 3 4
2017年01月10日 07:19:31.0 5 3
2017年01月10日 07:19:32.0 3 2
2017年01月10日 07:19:33.1 5 1
2017年01月10日 07:19:35.0 5 1
2017年01月10日 07:19:36.1 5 1
2017年01月10日 07:19:37.1 5 1
Then use MAX(date), MIN(date) grouping by level, id_type to get the desired result.
with RECURSIVE rdates as
(
(select date, id_type, 1 lv
from yourTable
order by date desc
limit 1
)
union
(select d.date, d.id_type,
case when r.id_type = d.id_type
then r.lv
else r.lv + 1
end lv
from yourTable d
inner join rdates r
on d.date < r.date
order by date desc
limit 1)
)
select min(date) StartDate,
max(date) EndDate,
id_type
from rdates
group by lv, id_type
;
+---------------------+---------------------+---------+
| startdate | enddate | id_type |
+---------------------+---------------------+---------+
| 10.01.2017 07:19:21 | 10.01.2017 07:19:25 | 3 |
| 10.01.2017 07:19:26 | 10.01.2017 07:19:26 | 5 |
| 10.01.2017 07:19:27 | 10.01.2017 07:19:27 | 3 |
| 10.01.2017 07:19:28 | 10.01.2017 07:19:29 | 5 |
| 10.01.2017 07:19:30 | 10.01.2017 07:19:30 | 3 |
| 10.01.2017 07:19:31 | 10.01.2017 07:19:31 | 5 |
| 10.01.2017 07:19:32 | 10.01.2017 07:19:32 | 3 |
| 10.01.2017 07:19:33 | 10.01.2017 07:19:37 | 5 |
+---------------------+---------------------+---------+
Check it: http://rextester.com/WCOYFP6623
Here is another method, which is similar to Evan's and Erwin's in that it uses LAG to determine islands. It differs from those solutions in that it uses only one level of nesting, no grouping, and considerably more window functions:
SELECT
id_type,
date AS begin,
COALESCE(
LEAD(prev_date) OVER (ORDER BY date ASC),
last_date
) AS end
FROM
(
SELECT
id_type,
date,
LAG(date) OVER (ORDER BY date ASC) AS prev_date,
MAX(date) OVER () AS last_date,
CASE id_type
WHEN LAG(id_type) OVER (ORDER BY date ASC)
THEN 0
ELSE 1
END AS is_start
FROM
tmp
) AS derived
WHERE
is_start = 1
ORDER BY
date ASC
;
The is_start
computed column in the nested SELECT marks the beginning of each island. Additionally, the nested SELECT exposes each row's previous date and the dataset's last date.
For rows that are the beginnings of their respective islands, the previous date effectively is the previous island's ending date. That is what the main SELECT uses it as. It picks only the rows matching the is_start = 1
condition, and for each returned row it shows the row's own date
as begin
and the following row's prev_date
as end
. As the last row does not have a following row, LEAD(prev_date)
returns a null for it, for which the COALESCE function substitutes the dataset's last date.
You can play with this solution at dbfiddle.
When introducing additional columns identifying the islands, you will probably want to introduce a PARTITION BY subclause to each window function's OVER clause. For instance, if you want to detect the islands within groups defined by a parent_id
, the above query will probably need to look like this:
SELECT
parent_id,
id_type,
date AS begin,
COALESCE(
LEAD(prev_date) OVER (PARTITION BY parent_id ORDER BY date ASC),
last_date
) AS end
FROM
(
SELECT
parent_id,
id_type,
date,
LAG(date) OVER (PARTITION BY parent_id ORDER BY date ASC) AS prev_date,
MAX(date) OVER (PARTITION BY parent_id) AS last_date,
CASE id_type
WHEN LAG(id_type) OVER (PARTITION BY parent_id ORDER BY date ASC)
THEN 0
ELSE 1
END AS is_start
FROM
tmp
) AS derived
WHERE
is_start = 1
ORDER BY
date ASC
;
And if you decide to go with either Erwin's or Evan's solution, I believe a similar change will need to be added to it as well.
More out of academic interest than as a practical solution, you can also achieve this with a user-defined aggregate. Like the other solutions, this will work even on Postgres 8.4, but as others have commented, please upgrade if you can.
The aggregate handles null
as if it is a different foo_type
, so runs of nulls would be given the same grp
— that may or may not be what you want.
create function grp_sfunc(integer[],integer) returns integer[] language sql as $$ select array[1ドル[1]+(1ドル[2] is distinct from 2ドル or 1ドル[3]=0)::integer,2,1ドル]; $$;
create function grp_finalfunc(integer[]) returns integer language sql as $$ select 1ドル[1]; $$;
create aggregate grp(integer)( sfunc = grp_sfunc , stype = integer[] , finalfunc = grp_finalfunc , initcond = '{0,0,0}' );
select min(foo_at) begin_at, max(foo_at) end_at, foo_type from (select *, grp(foo_type) over (order by foo_at) from foo) z group by grp, foo_type order by 1;
begin_at | end_at | foo_type :-------------------- | :-------------------- | -------: 2017年01月10日 07:19:21 | 2017年01月10日 07:19:25 | 3 2017年01月10日 07:19:26 | 2017年01月10日 07:19:26 | 5 2017年01月10日 07:19:27.1 | 2017年01月10日 07:19:27.1 | 3 2017年01月10日 07:19:28 | 2017年01月10日 07:19:29 | 5 2017年01月10日 07:19:30.1 | 2017年01月10日 07:19:30.1 | 3 2017年01月10日 07:19:31 | 2017年01月10日 07:19:31 | 5 2017年01月10日 07:19:32 | 2017年01月10日 07:19:32 | 3 2017年01月10日 07:19:33.1 | 2017年01月10日 07:19:37.1 | 5
dbfiddle here
This can be done with RECURSIVE CTE
to pass the "begin time" from one row to the next, and some extra (convenience) preparations.
This query returns the result you wish:
WITH RECURSIVE q AS
(
SELECT
id_type,
"date",
/* We compute next id_type for convenience, plus row_number */
row_number() OVER (w) AS rn,
lead(id_type) OVER (w) AS next_id_type
FROM
t
WINDOW
w AS (ORDER BY "date")
)
after the preparation... recursive part
, rec AS
(
/* Anchor */
SELECT
q.rn,
q."date" AS "begin",
/* When next_id_type is different from Look also at **next** row to find out whether we need to mark an end */
case when q.id_type is distinct from q.next_id_type then q."date" END AS "end",
q.id_type
FROM
q
WHERE
rn = 1
UNION ALL
/* Loop */
SELECT
q.rn,
/* We keep copying 'begin' from one row to the next while type doesn't change */
case when q.id_type = rec.id_type then rec.begin else q."date" end AS "begin",
case when q.id_type is distinct from q.next_id_type then q."date" end AS "end",
q.id_type
FROM
rec
JOIN q ON q.rn = rec.rn+1
)
-- We filter the rows where "end" is not null, and project only needed columns
SELECT
"begin", "end", id_type
FROM
rec
WHERE
"end" is not null ;
You can check this at http://rextester.com/POYM83542
This method doesn't scale well. For an 8_641 row table, it takes 7s, for a table twice that size, it takes 28s. A few samples more show execution times looking like O(n^2).
Evan Carrol's method takes less than 1s (i.e.: go for it!), and looks like O(n). Recursive queries are absolutely inefficient, and should be considered a last resort.
Explore related questions
See similar questions with these tags.