Nested aggregate functions over different dimensions

Question 1

I have a table where I periodically record the throughput of all network interfaces on a set of servers:

create table net (
 ts timestamptz not null,
 host text not null,
 interface text not null,
 recv_bytes bigint not null
);
create index on net (ts);
insert into net (ts, host, interface, recv_bytes) values
 ('2017-01-01 00:00:00+00', 'a', 'eth0', 500),
 ('2017-01-01 00:00:00+00', 'b', 'eth1', 2000),
 ('2017-01-01 00:00:01+00', 'b', 'eth0', 1000), -- measurements arrive with some jitter, +1s here
 ('2017-01-01 00:00:02+00', 'c', 'eth0', 100), -- only present in this interval
 ('2017-01-01 00:00:04+00', 'b', 'eth1', 1900),
 ('2017-01-01 00:00:05+00', 'a', 'eth0', 550),
 ('2017-01-01 00:00:05+00', 'b', 'eth0', 1200),
 ('2017-01-01 00:00:10+00', 'a', 'eth0', 600),
 ('2017-01-01 00:00:10+00', 'b', 'eth0', 1500),
 ('2017-01-01 00:00:11+00', 'b', 'eth1', 1900),
 ('2017-01-01 00:00:15+00', 'a', 'eth0', 600),
 ('2017-01-01 00:00:15+00', 'b', 'eth1', 1400),
 ('2017-01-01 00:00:16+00', 'b', 'eth0', 1400),
 ('2017-01-01 00:00:16+00', 'b', 'eth1', 1700); -- (b,eth1) appears 3 times in this interval

I would like to find the total throughput across all interfaces for a given time range, averaged over 10 second intervals. If a (host, interface) pair is missing data for a given interval, it can simply be omitted. This is what I came up with:

select ts_interval, host, sum(recv_bytes) as recv_bytes
from (
 select
 to_timestamp(floor(extract(epoch from ts) / 10) * 10) as ts_interval,
 host,
 interface,
 avg(recv_bytes) as recv_bytes
 from net
 group by ts_interval, host, interface
) avg_net
where ts_interval >= '2017-01-01 00:00:00+00' and ts_interval < '2017-01-01 00:00:20+00'
group by ts_interval, host
order by ts_interval, host;

Result:

 ts_interval | host | recv_bytes
------------------------+------+-----------------------
 2017年01月01日 00:00:00+00 | a | 525.0000000000000000
 2017年01月01日 00:00:00+00 | b | 3050.0000000000000000
 2017年01月01日 00:00:00+00 | c | 100.0000000000000000
 2017年01月01日 00:00:10+00 | a | 600.0000000000000000
 2017年01月01日 00:00:10+00 | b | 3116.6666666666666667

This query seems overly verbose. The essence of that I'm trying to do is to sum over one dimension and average over the other. In pseudo-SQL:

select
 to_timestamp(floor(extract(epoch from ts) / 10) * 10) as ts_interval,
 host,
 sum(avg(recv_bytes OVER ts_interval) OVER host) as recv_bytes
from net
where ts >= '2017-01-01 00:00:00+00' and ts < '2017-01-01 00:00:20+00'
group by ts_interval, host
order by ts_interval, host;

Is there a more concise way to do this?

Question 2

Very clear question with just the right amount of useful information. (Only your version of Postgres is missing.)

Question 3

@Erwin Whoops, I'm on Postgres 9.6. I added the version as a tag just now.

Question 4

Query

You can use generate_series() to generate a time raster to join to:

SELECT g AS ts_interval, host, sum(recv_bytes) AS recv_bytes
FROM (
 SELECT g, host, interface, avg(recv_bytes) AS recv_bytes
 FROM generate_series(timestamptz '2017-01-01 00:00:00+00'
 , timestamptz '2017-01-01 00:00:10+00' -- 10 - only lower bound
 , interval '10 sec') g
 JOIN net n ON ts >= g
 AND ts < g + interval '10 sec'
 GROUP BY g, host, interface
 ) sub
GROUP BY g, host
ORDER BY g, host;

Same result. Not less verbose, but much faster if you have an index on (ts).

Then again, the crucial point is to have "sargable" predicates, which you also get without generate_series() this way:

SELECT ts_interval, host, sum(recv_bytes) as recv_bytes
FROM (
 SELECT to_timestamp(trunc(extract(epoch from ts) / 10) * 10) AS ts_interval
 , host, interface, avg(recv_bytes) as recv_bytes
 FROM net
 WHERE ts >= '2017-01-01 00:00:00+00'
 AND ts < '2017-01-01 00:00:20+00' -- just make sure to match bounds
 GROUP BY 1, 2, 3
 ) avg_net
GROUP BY 1, 2
ORDER BY 1, 2;

Minor point: With exclusively positive numbers, trunc() is equivalent to floor() and a bit faster.

I am using ordinal numbers, since you asked for "less verbose" code. But that's not the core of the question here ...

Core question

You can run a window function over an aggregate function in one query level (though it's not typically faster than using a subquery).

But not the other way round: you cannot run an aggregate function over a window function in one query level. Just not possible. Window functions happen after aggregate functions in SQL.
Related:

Question 5

I've never heard the term "time raster." As someone who works in GIS, I'm used to "raster" specifically referring to pixel based data formats. Everything on the first page of Google for "time raster" seems to be GIS related, as well. Do you have any links explaining what you mean by the term?

Question 6

@jpmc26: That's just the wording I came up with, not a technical term. Maybe "time grid", "time series" or "time pattern" are less distracting.

Question 7

If there is limited amount of interfaces then it could be

select
 to_timestamp(floor(extract(epoch from ts) / 10) * 10) as ts_interval,
 host,
 coalesce(avg(recv_bytes) filter (where interface = 'eth0'), 0) + 
 coalesce(avg(recv_bytes) filter (where interface = 'eth1'), 0) +
 coalesce(avg(recv_bytes) filter (where interface = 'wlan0'), 0) -- etc
from
 net
group by 1,2
order by 1,2;

But I'm not sure that it is really better then the your original query.

score 3 · Accepted Answer · 2017-05-08 23:56:46Z

Query

You can use generate_series() to generate a time raster to join to:

SELECT g AS ts_interval, host, sum(recv_bytes) AS recv_bytes
FROM (
 SELECT g, host, interface, avg(recv_bytes) AS recv_bytes
 FROM generate_series(timestamptz '2017-01-01 00:00:00+00'
 , timestamptz '2017-01-01 00:00:10+00' -- 10 - only lower bound
 , interval '10 sec') g
 JOIN net n ON ts >= g
 AND ts < g + interval '10 sec'
 GROUP BY g, host, interface
 ) sub
GROUP BY g, host
ORDER BY g, host;

Same result. Not less verbose, but much faster if you have an index on (ts).

Then again, the crucial point is to have "sargable" predicates, which you also get without generate_series() this way:

SELECT ts_interval, host, sum(recv_bytes) as recv_bytes
FROM (
 SELECT to_timestamp(trunc(extract(epoch from ts) / 10) * 10) AS ts_interval
 , host, interface, avg(recv_bytes) as recv_bytes
 FROM net
 WHERE ts >= '2017-01-01 00:00:00+00'
 AND ts < '2017-01-01 00:00:20+00' -- just make sure to match bounds
 GROUP BY 1, 2, 3
 ) avg_net
GROUP BY 1, 2
ORDER BY 1, 2;

Minor point: With exclusively positive numbers, trunc() is equivalent to floor() and a bit faster.

I am using ordinal numbers, since you asked for "less verbose" code. But that's not the core of the question here ...

Core question

You can run a window function over an aggregate function in one query level (though it's not typically faster than using a subquery).

But not the other way round: you cannot run an aggregate function over a window function in one query level. Just not possible. Window functions happen after aggregate functions in SQL.
Related:

I've never heard the term "time raster." As someone who works in GIS, I'm used to "raster" specifically referring to pixel based data formats. Everything on the first page of Google for "time raster" seems to be GIS related, as well. Do you have any links explaining what you mean by the term?
@jpmc26: That's just the wording I came up with, not a technical term. Maybe "time grid", "time series" or "time pattern" are less distracting.

Stack Exchange Network

Nested aggregate functions over different dimensions

2 Answers 2

Query

Core question

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions