11

I have data like following:

 created_at | status 
---------------------+-------------
 2016年04月05日 1:27:15 | info
 2016年04月05日 3:27:15 | info
 2016年04月05日 5:27:15 | warn
 2016年04月05日 10:27:15 | info
 2016年04月05日 11:27:15 | warn

With this data, I want to convert like as follows:

 status | 2016年04月05日 1:00:00 | 2016年04月05日 4:00:00 | 2016年04月05日 8:00:00 | 2016年04月05日 12:00:00
---------+--------------------+--------------------+--------------------+-------------------
 info | 1 | 1 | 0 | 1
 warn | 0 | 0 | 1 | 1 

Can anyone suggest the best way to do this?

Erwin Brandstetter
186k28 gold badges463 silver badges636 bronze badges
asked Apr 10, 2016 at 7:41
2
  • 1
    1:00, 4:00, 8:00, 12:00 ...? There is no obvious pattern behind your chosen hours. And no consistent way of counting, either. 1:27 -> 1:00, but 5:27 -> 8:00? Please fix your question to make sense, explain your case and put a minimum of effort into it. "psql" is the PostgreSQL interactive terminal and not an abbreviation for Postgres, btw. Commented Apr 11, 2016 at 0:26
  • @ErwinBrandstetter doesn't matter - I've answered. Essentially, the timeslots can be arbitrary - using code like this "CASE WHEN created_at < ''2016年04月05日 02:00'' THEN 1 WHEN created_at >= ''2016年04月05日 02:00'' AND created_at < ''2016年04月05日 04:00'' THEN 2. Suggesting to OP 2 hour tranches, but it's up to them. I have 1 question to truly polish my answer though, here you say "second parameter can be any query that returns one row per attribute" - what happens if no such row exists? See my sample data! Commented Apr 11, 2016 at 1:11

2 Answers 2

8

Assuming 2016年04月05日 0:27:15 instead of (削除) 2016年04月05日 1:27:15 (削除ここまで) in the underlying table, the question would make more sense to me:

CREATE TABLE tbl (created_at timestamp, status text);
INSERT INTO tbl VALUES
 ('2016-04-05 00:27:15', 'info')
, ('2016-04-05 03:27:15', 'info')
, ('2016-04-05 05:27:15', 'warn')
, ('2016-04-05 10:27:15', 'info')
, ('2016-04-05 11:27:15', 'warn');

The logic would be to count events that happened up to and excluding the next bound. This fits the often overlooked function width_bucket() perfectly. To be precise, it requires the variant with arbitrary bounds (since there is no regular pattern in the OP's bounds) introduced with Postgres 9.5. Explanation straight from the manual:

width_bucket(operand anyelement, thresholds anyarray)

return the bucket number to which operand would be assigned given an array listing the lower bounds of the buckets; returns 0 for an input less than the first lower bound; the thresholds array must be sorted, smallest first, or unexpected results will be obtained

For regular buckets you can use another variant that's available in Postgres 9.1 as well.
Combine it with crosstab() re-using the same bounds as column names (the rest of the query works with Postgres 9.1):

SELECT * FROM crosstab(
 $$SELECT status
 , width_bucket(created_at, '{2016年04月05日 01:00
 , 2016年04月05日 04:00
 , 2016年04月05日 08:00
 , 2016年04月05日 12:00}'::timestamp[])
 , count(*)::int
 FROM tbl
 WHERE created_at < '2016-04-05 12:00' -- exclude later rows
 GROUP BY 1, 2
 ORDER BY 1, 2$$
, 'SELECT generate_series(0,3)'
 ) AS t(status text, "2016-04-05 01:00" int
 , "2016-04-05 04:00" int
 , "2016-04-05 08:00" int
 , "2016-04-05 12:00" int);

Result:

 status | 2016年04月05日 01:00 | 2016年04月05日 04:00 | 2016年04月05日 08:00 | 2016年04月05日 12:00
--------+------------------+------------------+------------------+------------------
 info | 1 | 1 | | 1
 warn | | | 1 | 1

The second crosstab parameter ('SELECT generate_series(0,3)') is a query string when executed returning one row for every target column. Every value not found on either side - not in the raw data or not generated by the 2nd parameter - is simply ignored.

Basics for crosstab():

Replace NULL with 0

If you need 0 instead of NULL in the result, fix with COALESCE(), but that's merely a cosmetic problem:

SELECT status
 , COALESCE(t0, 0) AS "2016-04-05 01:00"
 , COALESCE(t1, 0) AS "2016-04-05 04:00"
 , COALESCE(t2, 0) AS "2016-04-05 08:00"
 , COALESCE(t3, 0) AS "2016-04-05 12:00"
FROM crosstab(
 $$SELECT status
 , width_bucket(created_at, '{2016年04月05日 01:00
 , 2016年04月05日 04:00
 , 2016年04月05日 08:00
 , 2016年04月05日 12:00}'::timestamp[])
 , count(*)::int
 FROM tbl
 WHERE created_at < '2016-04-05 12:00'
 GROUP BY 1, 2
 ORDER BY 1, 2$$
, 'SELECT generate_series(0,3)'
 ) AS t(status text, t0 int, t1 int, t2 int, t3 int);

Result:

 status | 2016年04月05日 01:00 | 2016年04月05日 04:00 | 2016年04月05日 08:00 | 2016年04月05日 12:00
--------+------------------+------------------+------------------+------------------
 info | 1 | 1 | 0 | 1
 warn | 0 | 0 | 1 | 1

Adding totals

To add totals per status use the new GROUPING SETS in Postgres 9.5+

SELECT status
 , COALESCE(t0, 0) AS "2016-04-05 01:00"
 , COALESCE(t1, 0) AS "2016-04-05 04:00"
 , COALESCE(t2, 0) AS "2016-04-05 08:00"
 , COALESCE(t3, 0) AS "2016-04-05 12:00"
 , COALESCE(t4, 0) AS total
FROM crosstab(
 $$SELECT status, COALESCE(slot, -1), ct -- special slot for totals
 FROM (
 SELECT status
 , width_bucket(created_at, '{2016年04月05日 01:00
 , 2016年04月05日 04:00
 , 2016年04月05日 08:00
 , 2016年04月05日 12:00}'::timestamp[]) AS slot
 , count(*)::int AS ct
 FROM tbl
 WHERE created_at < '2016-04-05 12:00'
 GROUP BY GROUPING SETS ((1, 2), 1) -- add totals per status
 ORDER BY 1, 2
 ) sub$$
 , 'VALUES (0), (1), (2), (3), (-1)' -- switched to VALUES for more sophisticated series
 ) AS t(status text, t0 int, t1 int, t2 int, t3 int, t4 int);

Result like above, plus:

... | total
... -+-------
... | 3
... | 2

Note that total includes all rows not excluded before aggregation, even if filtered by crosstab().

This is in reply to @Vérace's request in the comments rather than to the unclear question.

answered Apr 11, 2016 at 2:27
2
  • Very nice - +1 (as usual!). Just one small question - what is the 'VALUES(0)...(-1)' expression doing? Especially the (-1) bit - I've got my head around everything else. Commented Apr 11, 2016 at 3:43
  • @Vérace: -1 is just the arbitrary value I assign to the "total" slot with COALESCE(slot, -1). ( I didn't add column aliases as those are ignored by crosstab() anyway.) The totals are added last by GROUPING SETS with NULL as slot value. Try the inner SELECT first to see ... Commented Apr 11, 2016 at 3:47
5

To answer this, I did the following. (I found the following three threads helpful - 1, 2 and 3. I also found the documentation on generate_series and CROSSTAB here and here respectively to be useful). This should work on 9.1 - not tested, but the documentation indicates that no post-9.1 stuff is being used.

Created a table:

ntest=# create table pv_tab(created_at timestamp, status varchar(10));

and populated it.

INSERT INTO pv_tab VALUES('2016-04-05 01:27:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 03:27:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 05:27:15', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:27:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 11:27:15', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 00:27:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 00:24:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 00:24:13', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 00:24:13', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 01:24:13', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 01:24:13', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 01:12:13', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 01:12:22', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 02:05:45', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 02:34:45', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:34:45', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:35:45', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:36:45', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:36:45', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 10:36:34', 'info');
(20 rows)

The (correct) result of my query is:

stat slot1 slot2 slot3 slot4 slot5 slot6 Total
---- ----- ----- ----- ----- ----- ----- -----
info 6 2 0 3 0 0 11
warn 3 1 1 4 0 0 9

The working query is:

SELECT * FROM CROSSTAB
(
'
WITH time_slots AS
(
 SELECT
 status,
 CASE
 -- Here I put the "created_at" values into "buckets" - it would
 -- not be desirable to have too many of these buckets - certainly
 -- any more than 12 would make the `SQL` and result unwieldy!
 -- I recommend that you create 2hr slots - 00:00 - 02:00, &c.
 -- This `CTE` splits the times into the various slots sample slots
 -- 1-4 - you can, of course, have more but it makes the `SQL` and 
 -- the answer more messy. Here, I;ve deliberately only used 4
 -- out of 6 in order to illustrate dealing with sparse data in 
 -- the result. (I used the OP;s initial slots - easy to change).
 WHEN created_at < ''2016-04-05 02:00'' THEN 1
 WHEN created_at >= ''2016-04-05 02:00'' AND created_at < ''2016-04-05 04:00'' THEN 2
 WHEN created_at >= ''2016-04-05 04:00'' AND created_at < ''2016-04-05 08:00'' THEN 3
 WHEN created_at >= ''2016-04-05 08:00'' AND created_at < ''2016-04-05 12:00'' THEN 4
 END AS time_slot,
 COUNT(status) AS stat_count
 FROM pv_tab
 GROUP BY status, time_slot
 ORDER BY status, time_slot
),
statuae AS
 -- Get all statuses. Hardly necessary when there are only two, but
 -- could be an issue later if more values are required ("unknown".. &c.).
(
 SELECT DISTINCT(status) AS stati
 FROM pv_tab
),
all_slots (slots) AS
 -- This `CTE` is necessary to perform a cross-join between statuses 
 -- and slots. This is because the `CROSSTAB` table function doesn;t 
 -- appear to play well with `NULL`s - see my question to Erwin
 -- Brandstetter in comments.
(
 SELECT generate_series(1, 6) -- six (should be) 2 hour slots. In any case, it is arbitrary!
),
stat_slots AS
 -- Here the statuses slots are cross-joined - i.e. all slots with all statuses.
(
 SELECT statuae.stati, all_slots.slots
 FROM statuae, all_slots
),
individual_stati AS
 -- `Left-join` the complete status/slot table with the actual slots in
 -- the sample table. NULL counts are `COALESCE`ed into 0 - necessary, otherwise
 -- `NULL`s "back up" the result and leaves blanks in the right-most 
 -- columns - and the totals appear in what should be slots. 
(
 SELECT ss.stati AS status, ss.slots AS time_slot, COALESCE(ts.stat_count, 0) AS counts
 FROM stat_slots ss
 LEFT JOIN time_slots ts
 ON ss.stati = ts.status AND ss.slots = ts.time_slot
 ORDER BY 1, 2
),
total_stati AS
 -- This is just pure showing off :-). I;m using this `CTE` to add
 -- a totals field to the query. Not asked for by the OP - can be
 -- ripped out! I got the idea for this from the 3rd link (top of post).
(
 SELECT status, 7 AS time_slot, count(status) AS counts -- 7 - an exta slot for totals
 FROM pv_tab
 GROUP BY status
)
-- Final query bringing it all together - nice, simple and elegant. :-)
SELECT status, time_slot, counts FROM individual_stati
UNION
SELECT status, time_slot, counts FROM total_stati
ORDER BY 1, 2
'
) AS My_Tab("stat" varchar(10), "slot1" bigint, "slot2" bigint, "slot3" bigint, "slot4" bigint, "slot5" bigint, "slot6" bigint, "Total" bigint);
answered Apr 11, 2016 at 1:19
1
  • 1
    NULL counts are COALESCEed into 0 - necessary, otherwise NULLs "back up" the result and leaves blanks in the right-most columns - that's what the 2nd parameter to crosstab() is for, explained in detail in my answer you referenced Commented Apr 11, 2016 at 3:36

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.