20

I have a situation I think can be solved using window function but I'm not sure.

Imagine the following table

CREATE TABLE tmp (
 date timestamp
, id_type integer
) ;
INSERT INTO tmp (date, id_type)
VALUES
 ( '2017-01-10 07:19:21.0', 3 ),
 ( '2017-01-10 07:19:22.0', 3 ),
 ( '2017-01-10 07:19:23.1', 3 ),
 ( '2017-01-10 07:19:24.1', 3 ),
 ( '2017-01-10 07:19:25.0', 3 ),
 ( '2017-01-10 07:19:26.0', 5 ),
 ( '2017-01-10 07:19:27.1', 3 ),
 ( '2017-01-10 07:19:28.0', 5 ),
 ( '2017-01-10 07:19:29.0', 5 ),
 ( '2017-01-10 07:19:30.1', 3 ),
 ( '2017-01-10 07:19:31.0', 5 ),
 ( '2017-01-10 07:19:32.0', 3 ),
 ( '2017-01-10 07:19:33.1', 5 ),
 ( '2017-01-10 07:19:35.0', 5 ),
 ( '2017-01-10 07:19:36.1', 5 ),
 ( '2017-01-10 07:19:37.1', 5 );

I'd like to have a new group at each change of value in column id_type. E.G. 1st group from 7:19:21 to 7:19:25, 2nd starting and finishing at 7:19:26, and so on.

At this moment, using the query below ...

SELECT distinct 
 min(min(date)) over w as begin, 
 max(max(date)) over w as end, 
 id_type
FROM tmp
GROUP BY id_type
WINDOW w AS (PARTITION BY id_type)
ORDER BY begin;

I get the following result:

begin end id_type
2017年01月10日 07:19:21.0 2017年01月10日 07:19:32.0 3
2017年01月10日 07:19:26.0 2017年01月10日 07:19:37.1 5

While I'd like:

begin end id_type
2017年01月10日 07:19:21.0 2017年01月10日 07:19:25.0 3
2017年01月10日 07:19:26.0 2017年01月10日 07:19:26.0 5
2017年01月10日 07:19:27.1 2017年01月10日 07:19:27.1 3
2017年01月10日 07:19:28.0 2017年01月10日 07:19:29.0 5
2017年01月10日 07:19:30.1 2017年01月10日 07:19:30.1 3
2017年01月10日 07:19:31.0 2017年01月10日 07:19:31.0 5
2017年01月10日 07:19:32.0 2017年01月10日 07:19:32.0 3
2017年01月10日 07:19:33.1 2017年01月10日 07:19:37.1 5

Once that works, I want to include more criteria to define groups, and these others will be nullable.

Postgres Version: 8.4. We have Postgres with PostGis, so it is not easy to upgrade. PostGis functions change names and there are other problems, but we are already rewriting everything and the new version will use a newer version 9.X with PostGis 2.x.

Erwin Brandstetter
186k28 gold badges463 silver badges636 bronze badges
asked Mar 6, 2017 at 20:40
1

7 Answers 7

10

For a few points,

  • Don't call a non-temporary table tmp that just gets confusing.
  • Don't use text for timestamps (you're doing that in your example we can tell because the timestamp didn't get truncated and has .0)
  • Don't call a field that has time in it date. If it has date and time, it's a timestamp (and store it as one)

Better to use a window function..

SELECT id_type, grp, min(date), max(date)
FROM (
 SELECT date, id_type, count(is_reset) OVER (ORDER BY date) AS grp
 FROM (
 SELECT date, id_type, CASE WHEN lag(id_type) OVER (ORDER BY date) <> id_type THEN 1 END AS is_reset
 FROM tmp
 ) AS t
) AS g
GROUP BY id_type, grp
ORDER BY min(date);

Outputs

 id_type | grp | min | max 
---------+-----+-----------------------+-----------------------
 3 | 0 | 2017年01月10日 07:19:21.0 | 2017年01月10日 07:19:25.0
 5 | 1 | 2017年01月10日 07:19:26.0 | 2017年01月10日 07:19:26.0
 3 | 2 | 2017年01月10日 07:19:27.1 | 2017年01月10日 07:19:27.1
 5 | 3 | 2017年01月10日 07:19:28.0 | 2017年01月10日 07:19:29.0
 3 | 4 | 2017年01月10日 07:19:30.1 | 2017年01月10日 07:19:30.1
 5 | 5 | 2017年01月10日 07:19:31.0 | 2017年01月10日 07:19:31.0
 3 | 6 | 2017年01月10日 07:19:32.0 | 2017年01月10日 07:19:32.0
 5 | 7 | 2017年01月10日 07:19:33.1 | 2017年01月10日 07:19:37.1
(8 rows)

Explaination

First we need resets.. We generate them with lag()

SELECT date, id_type, CASE WHEN lag(id_type) OVER (ORDER BY date) <> id_type THEN 1 END AS is_reset
FROM tmp
ORDER BY date;
 date | id_type | is_reset 
-----------------------+---------+----------
 2017年01月10日 07:19:21.0 | 3 | 
 2017年01月10日 07:19:22.0 | 3 | 
 2017年01月10日 07:19:23.1 | 3 | 
 2017年01月10日 07:19:24.1 | 3 | 
 2017年01月10日 07:19:25.0 | 3 | 
 2017年01月10日 07:19:26.0 | 5 | 1
 2017年01月10日 07:19:27.1 | 3 | 1
 2017年01月10日 07:19:28.0 | 5 | 1
 2017年01月10日 07:19:29.0 | 5 | 
 2017年01月10日 07:19:30.1 | 3 | 1
 2017年01月10日 07:19:31.0 | 5 | 1
 2017年01月10日 07:19:32.0 | 3 | 1
 2017年01月10日 07:19:33.1 | 5 | 1
 2017年01月10日 07:19:35.0 | 5 | 
 2017年01月10日 07:19:36.1 | 5 | 
 2017年01月10日 07:19:37.1 | 5 | 
(16 rows)

Then we count to get groups.

SELECT date, id_type, count(is_reset) OVER (ORDER BY date) AS grp
FROM (
 SELECT date, id_type, CASE WHEN lag(id_type) OVER (ORDER BY date) <> id_type THEN 1 END AS is_reset
 FROM tmp
 ORDER BY date
) AS t
ORDER BY date
 date | id_type | grp 
-----------------------+---------+-----
 2017年01月10日 07:19:21.0 | 3 | 0
 2017年01月10日 07:19:22.0 | 3 | 0
 2017年01月10日 07:19:23.1 | 3 | 0
 2017年01月10日 07:19:24.1 | 3 | 0
 2017年01月10日 07:19:25.0 | 3 | 0
 2017年01月10日 07:19:26.0 | 5 | 1
 2017年01月10日 07:19:27.1 | 3 | 2
 2017年01月10日 07:19:28.0 | 5 | 3
 2017年01月10日 07:19:29.0 | 5 | 3
 2017年01月10日 07:19:30.1 | 3 | 4
 2017年01月10日 07:19:31.0 | 5 | 5
 2017年01月10日 07:19:32.0 | 3 | 6
 2017年01月10日 07:19:33.1 | 5 | 7
 2017年01月10日 07:19:35.0 | 5 | 7
 2017年01月10日 07:19:36.1 | 5 | 7
 2017年01月10日 07:19:37.1 | 5 | 7
(16 rows)

Then we wrap in a subselect GROUP BY and ORDER and select the min max (range)

SELECT id_type, grp, min(date), max(date)
FROM (
 .. stuff
) AS g
GROUP BY id_type, grp
ORDER BY min(date);
answered Mar 6, 2017 at 22:58
0
19

1. Window functions plus subqueries

Count the steps to form groups, similar to Evan's idea, with modifications and fixes:

SELECT id_type
 , min(date) AS begin
 , max(date) AS end
 , count(*) AS row_ct -- optional addition
FROM (
 SELECT date, id_type, count(step OR NULL) OVER (ORDER BY date) AS grp
 FROM (
 SELECT date, id_type
 , lag(id_type, 1, id_type) OVER (ORDER BY date) <> id_type AS step
 FROM tmp
 ) sub1
 ) sub2
GROUP BY id_type, grp
ORDER BY min(date);

This assumes involved columns are NOT NULL. Else you need to do more.

Also assuming date to be defined UNIQUE, else you need to add a tiebreaker to the ORDER BY clauses get deterministic results. Like: ORDER BY date, id.

Detailed explanation (answer to very similar question):

Note in particular:

  • In related cases, lag() with 3 parameters can be essential to cover the corner case of the first (or last) row elegantly. (The 3rd param is used as default if there is no previous (next) row.

     lag(id_type, 1, id_type) OVER ()
    

Since we are only interested in an actual change of id_type (TRUE), it does not matter in this particular case. NULL and FALSE both don't count as step.

  • count(step OR NULL) OVER (ORDER BY date) is the shortest syntax that also works in Postgres 9.3 or older. count() only counts non-null values ...

In modern Postgres, the cleaner, equivalent syntax would be:

 count(step) FILTER (WHERE step) OVER (ORDER BY date)

Details:

2. Subtract two window functions, one subquery

Similar to Erik's idea with modifications:

SELECT min(date) AS begin
 , max(date) AS end
 , id_type
FROM (
 SELECT date, id_type
 , row_number() OVER (ORDER BY date)
 - row_number() OVER (PARTITION BY id_type ORDER BY date) AS grp
 FROM tmp
 ) sub
GROUP BY id_type, grp
ORDER BY min(date);

If date is defined UNIQUE, like I mention above, dense_rank() would be pointless, since the result is the same as for row_number() and the latter is substantially cheaper.

If date is not defined UNIQUE (and we don't know that the only duplicates are on (date, id_type)), all of these queries are pointless, since the result is arbitrary.

Also, a subquery is typically cheaper than a CTE in Postgres. Only use CTEs when you need them.

Related answers with more explanation:

In related cases where we already have a running number in the table, we can make do with a single window function:

3. Top performance with plpgsql function

Since this question has become unexpectedly popular, I'll add another solution to demonstrate top performance.

SQL has many sophisticated tools to create solutions with short and elegant syntax. But a declarative language has its limits for more complex requirements that involve procedural elements.

A procedural solution with a server-side function is faster for this than anything posted so far because it only needs a single sequential scan over the table and a single sort operation. If a fitting index is available, even just a single index-only scan.

CREATE OR REPLACE FUNCTION f_tmp_groups()
 RETURNS TABLE (id_type int, grp_begin timestamp, grp_end timestamp)
 LANGUAGE plpgsql AS
$func$
DECLARE
 _row tmp; -- use table type for row variable
BEGIN
 FOR _row IN
 TABLE tmp ORDER BY date -- add more columns for deterministic order 
 LOOP
 CASE _row.id_type = id_type 
 WHEN TRUE THEN -- same group continues
 grp_end := _row.date; -- remember last date so far
 WHEN FALSE THEN -- next group starts
 RETURN NEXT; -- return result for last group
 id_type := _row.id_type;
 grp_begin := _row.date;
 grp_end := _row.date;
 ELSE -- NULL for 1st row
 id_type := _row.id_type; -- remember row data for starters
 grp_begin := _row.date;
 grp_end := _row.date;
 END CASE;
 END LOOP;
 RETURN NEXT; -- return last result row 
END
$func$;

Call:

SELECT * FROM f_tmp_groups();

Test with:

EXPLAIN (ANALYZE, TIMING OFF) -- to focus on total performance
SELECT * FROM f_tmp_groups();

You could make the function generic with polymorphic types and pass table type and column names. Details:

If you don't want to or cannot persist a function for this, it would even pay to create a temporary function on the fly. Costs a few ms.


db<>fiddle here - comparing performance for all three. (Building on Jack's test case, modified.)

dbfiddle for Postgres 8.4, where performance differences are even bigger. (Not operational any more.)

answered Mar 6, 2017 at 23:13
3
  • Read this a few times - still unsure of what you're talking about with the three argument lag or when you'd have to use count(x or null) or even what it's doing there. Perhaps you could show some samples where it is required, because it's not required here. And, what would key the requirement to cover those corner cases. BTW, I changed my downvote to the upvote just for the pl/pgsql example. That's really cool. (But, generally I'm against answers that summarize other answers or cover corner cases -- though I hate to say that this is a corner case because I don't understand it). Commented Mar 14, 2017 at 5:39
  • I would put them in two seperate self-answered questions because I'm sure I'm not the only one wondering what count(x or null) does. I'll be happy to ask both questions if you would prefer. Commented Mar 14, 2017 at 5:44
  • Here is one question In what case is a count(x or null) needed in the Gaps and Islands? Commented Mar 14, 2017 at 5:57
7

You can do this as a simple subtraction of ROW_NUMBER() operations (or if your dates are not unique, though still unique per id_type, then you can use DENSE_RANK() instead, though it will be a more expensive query):

WITH IdTypes AS (
 SELECT
 date,
 id_type,
 Row_Number() OVER (ORDER BY date)
 - Row_Number() OVER (PARTITION BY id_type ORDER BY date)
 AS Seq
 FROM
 tmp
)
SELECT
 Min(date) AS begin,
 Max(date) AS end,
 id_type
FROM IdTypes
GROUP BY id_type, Seq
ORDER BY begin
;

See this work at DB Fiddle (or see the DENSE_RANK version)

Result:

begin end id_type
--------------------- --------------------- -------
2017年01月10日 07:19:21 2017年01月10日 07:19:25 3
2017年01月10日 07:19:26 2017年01月10日 07:19:26 5
2017年01月10日 07:19:27.1 2017年01月10日 07:19:27.1 3
2017年01月10日 07:19:28 2017年01月10日 07:19:29 5
2017年01月10日 07:19:30.1 2017年01月10日 07:19:30.1 3
2017年01月10日 07:19:31 2017年01月10日 07:19:31 5
2017年01月10日 07:19:32 2017年01月10日 07:19:32 3
2017年01月10日 07:19:33.1 2017年01月10日 07:19:37.1 5

Logically, you can think of this as a simple DENSE_RANK() with a PREORDER BY, that is, you want the DENSE_RANK of all the items that are ranked together, and you want them ordered by the dates, you just have to deal with the pesky problem of the fact that at each change in the date, DENSE_RANK will increment. You do that by using the expression as I showed you above. Imagine if you had this syntax: DENSE_RANK() OVER (PREORDER BY date, ORDER BY id_type) where the PREORDER is excluded from the ranking calculation and only the ORDER BY is counted.

Note that it's important to GROUP BY both the generated Seq column as well as the id_type column. Seq is NOT unique by itself, there can be overlaps--you must also group by id_type.

For further reading on this topic:

That first link gives you some code you can use if you wanted the begin or end date to be the same as the previous or next period's end/begin date (so there are no gaps). Plus other versions that could assist you in your query. Though they have to be translated from SQL Server syntax...

answered Mar 9, 2017 at 0:02
0
6

On Postgres 8.4 you can use a RECURSIVE function.

How do they do it

The recursive function adds a level to each different id_type, by selecting dates one by one on descending order.

 date | id_type | lv
--------------------------------------
2017年01月10日 07:19:21.0 3 8
2017年01月10日 07:19:22.0 3 8
2017年01月10日 07:19:23.1 3 8
2017年01月10日 07:19:24.1 3 8
2017年01月10日 07:19:25.0 3 8
2017年01月10日 07:19:26.0 5 7
2017年01月10日 07:19:27.1 3 6
2017年01月10日 07:19:28.0 5 5
2017年01月10日 07:19:29.0 5 5
2017年01月10日 07:19:30.1 3 4
2017年01月10日 07:19:31.0 5 3
2017年01月10日 07:19:32.0 3 2
2017年01月10日 07:19:33.1 5 1
2017年01月10日 07:19:35.0 5 1
2017年01月10日 07:19:36.1 5 1
2017年01月10日 07:19:37.1 5 1

Then use MAX(date), MIN(date) grouping by level, id_type to get the desired result.

with RECURSIVE rdates as 
(
 (select date, id_type, 1 lv 
 from yourTable
 order by date desc
 limit 1
 )
 union
 (select d.date, d.id_type,
 case when r.id_type = d.id_type 
 then r.lv 
 else r.lv + 1 
 end lv 
 from yourTable d
 inner join rdates r
 on d.date < r.date
 order by date desc
 limit 1)
)
select min(date) StartDate,
 max(date) EndDate,
 id_type
from rdates
group by lv, id_type
;
+---------------------+---------------------+---------+
| startdate | enddate | id_type |
+---------------------+---------------------+---------+
| 10.01.2017 07:19:21 | 10.01.2017 07:19:25 | 3 |
| 10.01.2017 07:19:26 | 10.01.2017 07:19:26 | 5 |
| 10.01.2017 07:19:27 | 10.01.2017 07:19:27 | 3 |
| 10.01.2017 07:19:28 | 10.01.2017 07:19:29 | 5 |
| 10.01.2017 07:19:30 | 10.01.2017 07:19:30 | 3 |
| 10.01.2017 07:19:31 | 10.01.2017 07:19:31 | 5 |
| 10.01.2017 07:19:32 | 10.01.2017 07:19:32 | 3 |
| 10.01.2017 07:19:33 | 10.01.2017 07:19:37 | 5 |
+---------------------+---------------------+---------+

Check it: http://rextester.com/WCOYFP6623

answered Mar 6, 2017 at 22:41
5

Here is another method, which is similar to Evan's and Erwin's in that it uses LAG to determine islands. It differs from those solutions in that it uses only one level of nesting, no grouping, and considerably more window functions:

SELECT
 id_type,
 date AS begin,
 COALESCE(
 LEAD(prev_date) OVER (ORDER BY date ASC),
 last_date
 ) AS end
FROM
 (
 SELECT
 id_type,
 date,
 LAG(date) OVER (ORDER BY date ASC) AS prev_date,
 MAX(date) OVER () AS last_date,
 CASE id_type
 WHEN LAG(id_type) OVER (ORDER BY date ASC)
 THEN 0
 ELSE 1
 END AS is_start
 FROM
 tmp
 ) AS derived
WHERE
 is_start = 1
ORDER BY
 date ASC
;

The is_start computed column in the nested SELECT marks the beginning of each island. Additionally, the nested SELECT exposes each row's previous date and the dataset's last date.

For rows that are the beginnings of their respective islands, the previous date effectively is the previous island's ending date. That is what the main SELECT uses it as. It picks only the rows matching the is_start = 1 condition, and for each returned row it shows the row's own date as begin and the following row's prev_date as end. As the last row does not have a following row, LEAD(prev_date) returns a null for it, for which the COALESCE function substitutes the dataset's last date.

You can play with this solution at dbfiddle.

When introducing additional columns identifying the islands, you will probably want to introduce a PARTITION BY subclause to each window function's OVER clause. For instance, if you want to detect the islands within groups defined by a parent_id, the above query will probably need to look like this:

SELECT
 parent_id,
 id_type,
 date AS begin,
 COALESCE(
 LEAD(prev_date) OVER (PARTITION BY parent_id ORDER BY date ASC),
 last_date
 ) AS end
FROM
 (
 SELECT
 parent_id,
 id_type,
 date,
 LAG(date) OVER (PARTITION BY parent_id ORDER BY date ASC) AS prev_date,
 MAX(date) OVER (PARTITION BY parent_id) AS last_date,
 CASE id_type
 WHEN LAG(id_type) OVER (PARTITION BY parent_id ORDER BY date ASC)
 THEN 0
 ELSE 1
 END AS is_start
 FROM
 tmp
 ) AS derived
WHERE
 is_start = 1
ORDER BY
 date ASC
;

And if you decide to go with either Erwin's or Evan's solution, I believe a similar change will need to be added to it as well.

answered Mar 7, 2017 at 10:30
5

More out of academic interest than as a practical solution, you can also achieve this with a user-defined aggregate. Like the other solutions, this will work even on Postgres 8.4, but as others have commented, please upgrade if you can.

The aggregate handles null as if it is a different foo_type, so runs of nulls would be given the same grp — that may or may not be what you want.

create function grp_sfunc(integer[],integer) returns integer[] language sql as $$
 select array[1ドル[1]+(1ドル[2] is distinct from 2ドル or 1ドル[3]=0)::integer,2,1ドル];
$$;
create function grp_finalfunc(integer[]) returns integer language sql as $$
 select 1ドル[1];
$$;
create aggregate grp(integer)(
 sfunc = grp_sfunc
, stype = integer[]
, finalfunc = grp_finalfunc
, initcond = '{0,0,0}'
);
select min(foo_at) begin_at, max(foo_at) end_at, foo_type
from (select *, grp(foo_type) over (order by foo_at) from foo) z
group by grp, foo_type
order by 1;
begin_at | end_at | foo_type
:-------------------- | :-------------------- | -------:
2017年01月10日 07:19:21 | 2017年01月10日 07:19:25 | 3
2017年01月10日 07:19:26 | 2017年01月10日 07:19:26 | 5
2017年01月10日 07:19:27.1 | 2017年01月10日 07:19:27.1 | 3
2017年01月10日 07:19:28 | 2017年01月10日 07:19:29 | 5
2017年01月10日 07:19:30.1 | 2017年01月10日 07:19:30.1 | 3
2017年01月10日 07:19:31 | 2017年01月10日 07:19:31 | 5
2017年01月10日 07:19:32 | 2017年01月10日 07:19:32 | 3
2017年01月10日 07:19:33.1 | 2017年01月10日 07:19:37.1 | 5

dbfiddle here

answered Mar 7, 2017 at 15:51
0
4

This can be done with RECURSIVE CTE to pass the "begin time" from one row to the next, and some extra (convenience) preparations.

This query returns the result you wish:

WITH RECURSIVE q AS
(
 SELECT
 id_type,
 "date",
 /* We compute next id_type for convenience, plus row_number */
 row_number() OVER (w) AS rn,
 lead(id_type) OVER (w) AS next_id_type
 FROM
 t
 WINDOW
 w AS (ORDER BY "date") 
)

after the preparation... recursive part

, rec AS 
(
 /* Anchor */
 SELECT
 q.rn,
 q."date" AS "begin",
 /* When next_id_type is different from Look also at **next** row to find out whether we need to mark an end */
 case when q.id_type is distinct from q.next_id_type then q."date" END AS "end",
 q.id_type
 FROM
 q
 WHERE
 rn = 1
 UNION ALL
 /* Loop */
 SELECT
 q.rn,
 /* We keep copying 'begin' from one row to the next while type doesn't change */
 case when q.id_type = rec.id_type then rec.begin else q."date" end AS "begin",
 case when q.id_type is distinct from q.next_id_type then q."date" end AS "end",
 q.id_type
 FROM
 rec
 JOIN q ON q.rn = rec.rn+1
)
-- We filter the rows where "end" is not null, and project only needed columns
SELECT
 "begin", "end", id_type
FROM
 rec
WHERE
 "end" is not null ;

You can check this at http://rextester.com/POYM83542

This method doesn't scale well. For an 8_641 row table, it takes 7s, for a table twice that size, it takes 28s. A few samples more show execution times looking like O(n^2).

Evan Carrol's method takes less than 1s (i.e.: go for it!), and looks like O(n). Recursive queries are absolutely inefficient, and should be considered a last resort.

answered Mar 6, 2017 at 22:40

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.