I have data like following:
created_at | status ---------------------+------------- 2016年04月05日 1:27:15 | info 2016年04月05日 3:27:15 | info 2016年04月05日 5:27:15 | warn 2016年04月05日 10:27:15 | info 2016年04月05日 11:27:15 | warn
With this data, I want to convert like as follows:
status | 2016年04月05日 1:00:00 | 2016年04月05日 4:00:00 | 2016年04月05日 8:00:00 | 2016年04月05日 12:00:00 ---------+--------------------+--------------------+--------------------+------------------- info | 1 | 1 | 0 | 1 warn | 0 | 0 | 1 | 1
Can anyone suggest the best way to do this?
2 Answers 2
Assuming 2016年04月05日 0:27:15
instead of (削除) in the underlying table, the question would make more sense to me:2016年04月05日 1:27:15
(削除ここまで)
CREATE TABLE tbl (created_at timestamp, status text);
INSERT INTO tbl VALUES
('2016-04-05 00:27:15', 'info')
, ('2016-04-05 03:27:15', 'info')
, ('2016-04-05 05:27:15', 'warn')
, ('2016-04-05 10:27:15', 'info')
, ('2016-04-05 11:27:15', 'warn');
The logic would be to count events that happened up to and excluding the next bound. This fits the often overlooked function width_bucket()
perfectly. To be precise, it requires the variant with arbitrary bounds (since there is no regular pattern in the OP's bounds) introduced with Postgres 9.5. Explanation straight from the manual:
width_bucket(operand anyelement, thresholds anyarray)
return the bucket number to which
operand
would be assigned given an array listing the lower bounds of the buckets; returns0
for an input less than the first lower bound; thethresholds
array must be sorted, smallest first, or unexpected results will be obtained
For regular buckets you can use another variant that's available in Postgres 9.1 as well.
Combine it with crosstab()
re-using the same bounds as column names (the rest of the query works with Postgres 9.1):
SELECT * FROM crosstab(
$$SELECT status
, width_bucket(created_at, '{2016年04月05日 01:00
, 2016年04月05日 04:00
, 2016年04月05日 08:00
, 2016年04月05日 12:00}'::timestamp[])
, count(*)::int
FROM tbl
WHERE created_at < '2016-04-05 12:00' -- exclude later rows
GROUP BY 1, 2
ORDER BY 1, 2$$
, 'SELECT generate_series(0,3)'
) AS t(status text, "2016-04-05 01:00" int
, "2016-04-05 04:00" int
, "2016-04-05 08:00" int
, "2016-04-05 12:00" int);
Result:
status | 2016年04月05日 01:00 | 2016年04月05日 04:00 | 2016年04月05日 08:00 | 2016年04月05日 12:00
--------+------------------+------------------+------------------+------------------
info | 1 | 1 | | 1
warn | | | 1 | 1
The second crosstab parameter ('SELECT generate_series(0,3)'
) is a query string when executed returning one row for every target column. Every value not found on either side - not in the raw data or not generated by the 2nd parameter - is simply ignored.
Basics for crosstab()
:
Replace NULL with 0
If you need 0
instead of NULL
in the result, fix with COALESCE()
, but that's merely a cosmetic problem:
SELECT status
, COALESCE(t0, 0) AS "2016-04-05 01:00"
, COALESCE(t1, 0) AS "2016-04-05 04:00"
, COALESCE(t2, 0) AS "2016-04-05 08:00"
, COALESCE(t3, 0) AS "2016-04-05 12:00"
FROM crosstab(
$$SELECT status
, width_bucket(created_at, '{2016年04月05日 01:00
, 2016年04月05日 04:00
, 2016年04月05日 08:00
, 2016年04月05日 12:00}'::timestamp[])
, count(*)::int
FROM tbl
WHERE created_at < '2016-04-05 12:00'
GROUP BY 1, 2
ORDER BY 1, 2$$
, 'SELECT generate_series(0,3)'
) AS t(status text, t0 int, t1 int, t2 int, t3 int);
Result:
status | 2016年04月05日 01:00 | 2016年04月05日 04:00 | 2016年04月05日 08:00 | 2016年04月05日 12:00
--------+------------------+------------------+------------------+------------------
info | 1 | 1 | 0 | 1
warn | 0 | 0 | 1 | 1
Adding totals
To add totals per status
use the new GROUPING SETS
in Postgres 9.5+
SELECT status
, COALESCE(t0, 0) AS "2016-04-05 01:00"
, COALESCE(t1, 0) AS "2016-04-05 04:00"
, COALESCE(t2, 0) AS "2016-04-05 08:00"
, COALESCE(t3, 0) AS "2016-04-05 12:00"
, COALESCE(t4, 0) AS total
FROM crosstab(
$$SELECT status, COALESCE(slot, -1), ct -- special slot for totals
FROM (
SELECT status
, width_bucket(created_at, '{2016年04月05日 01:00
, 2016年04月05日 04:00
, 2016年04月05日 08:00
, 2016年04月05日 12:00}'::timestamp[]) AS slot
, count(*)::int AS ct
FROM tbl
WHERE created_at < '2016-04-05 12:00'
GROUP BY GROUPING SETS ((1, 2), 1) -- add totals per status
ORDER BY 1, 2
) sub$$
, 'VALUES (0), (1), (2), (3), (-1)' -- switched to VALUES for more sophisticated series
) AS t(status text, t0 int, t1 int, t2 int, t3 int, t4 int);
Result like above, plus:
... | total
... -+-------
... | 3
... | 2
Note that total
includes all rows not excluded before aggregation, even if filtered by crosstab()
.
This is in reply to @Vérace's request in the comments rather than to the unclear question.
-
Very nice - +1 (as usual!). Just one small question - what is the 'VALUES(0)...(-1)' expression doing? Especially the (-1) bit - I've got my head around everything else.Vérace– Vérace2016年04月11日 03:43:26 +00:00Commented Apr 11, 2016 at 3:43
-
@Vérace:
-1
is just the arbitrary value I assign to the "total" slot withCOALESCE(slot, -1)
. ( I didn't add column aliases as those are ignored bycrosstab()
anyway.) The totals are added last byGROUPING SETS
withNULL
asslot
value. Try the innerSELECT
first to see ...Erwin Brandstetter– Erwin Brandstetter2016年04月11日 03:47:52 +00:00Commented Apr 11, 2016 at 3:47
To answer this, I did the following. (I found the following three threads helpful - 1, 2 and 3. I also found the documentation on generate_series
and CROSSTAB
here and here respectively to be useful). This should work on 9.1 - not tested, but the documentation indicates that no post-9.1 stuff is being used.
Created a table:
ntest=# create table pv_tab(created_at timestamp, status varchar(10));
and populated it.
INSERT INTO pv_tab VALUES('2016-04-05 01:27:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 03:27:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 05:27:15', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:27:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 11:27:15', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 00:27:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 00:24:15', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 00:24:13', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 00:24:13', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 01:24:13', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 01:24:13', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 01:12:13', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 01:12:22', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 02:05:45', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 02:34:45', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:34:45', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:35:45', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:36:45', 'warn');
INSERT INTO pv_tab VALUES('2016-04-05 10:36:45', 'info');
INSERT INTO pv_tab VALUES('2016-04-05 10:36:34', 'info');
(20 rows)
The (correct) result of my query is:
stat slot1 slot2 slot3 slot4 slot5 slot6 Total
---- ----- ----- ----- ----- ----- ----- -----
info 6 2 0 3 0 0 11
warn 3 1 1 4 0 0 9
The working query is:
SELECT * FROM CROSSTAB
(
'
WITH time_slots AS
(
SELECT
status,
CASE
-- Here I put the "created_at" values into "buckets" - it would
-- not be desirable to have too many of these buckets - certainly
-- any more than 12 would make the `SQL` and result unwieldy!
-- I recommend that you create 2hr slots - 00:00 - 02:00, &c.
-- This `CTE` splits the times into the various slots sample slots
-- 1-4 - you can, of course, have more but it makes the `SQL` and
-- the answer more messy. Here, I;ve deliberately only used 4
-- out of 6 in order to illustrate dealing with sparse data in
-- the result. (I used the OP;s initial slots - easy to change).
WHEN created_at < ''2016-04-05 02:00'' THEN 1
WHEN created_at >= ''2016-04-05 02:00'' AND created_at < ''2016-04-05 04:00'' THEN 2
WHEN created_at >= ''2016-04-05 04:00'' AND created_at < ''2016-04-05 08:00'' THEN 3
WHEN created_at >= ''2016-04-05 08:00'' AND created_at < ''2016-04-05 12:00'' THEN 4
END AS time_slot,
COUNT(status) AS stat_count
FROM pv_tab
GROUP BY status, time_slot
ORDER BY status, time_slot
),
statuae AS
-- Get all statuses. Hardly necessary when there are only two, but
-- could be an issue later if more values are required ("unknown".. &c.).
(
SELECT DISTINCT(status) AS stati
FROM pv_tab
),
all_slots (slots) AS
-- This `CTE` is necessary to perform a cross-join between statuses
-- and slots. This is because the `CROSSTAB` table function doesn;t
-- appear to play well with `NULL`s - see my question to Erwin
-- Brandstetter in comments.
(
SELECT generate_series(1, 6) -- six (should be) 2 hour slots. In any case, it is arbitrary!
),
stat_slots AS
-- Here the statuses slots are cross-joined - i.e. all slots with all statuses.
(
SELECT statuae.stati, all_slots.slots
FROM statuae, all_slots
),
individual_stati AS
-- `Left-join` the complete status/slot table with the actual slots in
-- the sample table. NULL counts are `COALESCE`ed into 0 - necessary, otherwise
-- `NULL`s "back up" the result and leaves blanks in the right-most
-- columns - and the totals appear in what should be slots.
(
SELECT ss.stati AS status, ss.slots AS time_slot, COALESCE(ts.stat_count, 0) AS counts
FROM stat_slots ss
LEFT JOIN time_slots ts
ON ss.stati = ts.status AND ss.slots = ts.time_slot
ORDER BY 1, 2
),
total_stati AS
-- This is just pure showing off :-). I;m using this `CTE` to add
-- a totals field to the query. Not asked for by the OP - can be
-- ripped out! I got the idea for this from the 3rd link (top of post).
(
SELECT status, 7 AS time_slot, count(status) AS counts -- 7 - an exta slot for totals
FROM pv_tab
GROUP BY status
)
-- Final query bringing it all together - nice, simple and elegant. :-)
SELECT status, time_slot, counts FROM individual_stati
UNION
SELECT status, time_slot, counts FROM total_stati
ORDER BY 1, 2
'
) AS My_Tab("stat" varchar(10), "slot1" bigint, "slot2" bigint, "slot3" bigint, "slot4" bigint, "slot5" bigint, "slot6" bigint, "Total" bigint);
-
1
NULL counts are COALESCEed into 0 - necessary, otherwise NULLs "back up" the result and leaves blanks in the right-most columns
- that's what the 2nd parameter tocrosstab()
is for, explained in detail in my answer you referencedErwin Brandstetter– Erwin Brandstetter2016年04月11日 03:36:14 +00:00Commented Apr 11, 2016 at 3:36
1:00, 4:00, 8:00, 12:00
...? There is no obvious pattern behind your chosen hours. And no consistent way of counting, either.1:27
->1:00
, but5:27
->8:00
? Please fix your question to make sense, explain your case and put a minimum of effort into it. "psql" is the PostgreSQL interactive terminal and not an abbreviation for Postgres, btw.