Time series index range contains operator (reverse)

Question 1

Hi I am trying to optimise the a timestamp range contains <@ query for Postgres 12

I have done a bit of reading postgres documentation and discovered only GiST and SP-GiST indexes support this operator. However, I can't add one of these ( I tink I would need to add one to the heartrate table - see Schema below, but that is not a range type...).

My question is similar to this question and this one which also indicate I will need a GiST index. However, they are the other way around, eg the column they have a single timestamp and want to return from a table of tsranges all records that are contained. I have a table of timestamps and want to join it to a table of tsranges

For a bit of context for my schema, I have a collection of heartrates in the real dataset sampled ~1/3seconds, and a list of songs I have listen to, and when. I would like to query things like

avg(heartrate) for a particular track and artist
avg(heartrate) for a particular artist
etc.

Schema

create table heartrate (
 "time" timestamp primary key ,
 value float
)
;
CREATE INDEX ON heartrate ("time", value);
-- CREATE INDEX ON heartrate USING GIST ("time", value); can't do as "time" is not a range column.
-- one gets the following error: 
--- ERROR: data type timestamp without time zone has no default operator class for access method "gist" Hint: You must specify an operator class for the index or define a default operator class for the data type.
create table song_play(
 track TEXT NOT NULL,
 artist TEXT NOT NULL,
 play tsrange not null
)
;
CREATE INDEX ON song_play(track, artist);
INSERT INTO heartrate("time", value)
SELECT d, 60+60*random()
FROM generate_series('2015-01-01 00:00:00'::timestamp, '2020-01-01 00:00:00'::timestamp, '5 min'::interval) d
;
INSERT INTO song_play(track,artist, play)
SELECT case when random() > 0.5 then 'a' when random() > 0.5 then 'b' else 'c' end 
, case when random() > 0.5 then 'a' when random() > 0.5 then 'b' else 'c' end
, tsrange(d, d+ (((random()*3+1)::text|| 'min')::interval))
FROM generate_series('2015-01-01 00:00:00'::timestamp, '2020-01-01 00:00:00'::timestamp, '1 day'::interval) d
;
EXPLAIN SELECT sp.track, sp.artist, avg(h.value) FROM song_play sp left join heartrate h ON h.time <@ sp.play where sp.track='a' and sp.artist='b' GROUP BY sp.track, sp.artist;

Which results in the following:

✓
✓
✓
✓
525889 rows affected
1827 rows affected
| QUERY PLAN |
| :--------------------------------------------------------------------------------------------------------- |
| GroupAggregate (cost=0.28..14689.24 rows=1 width=72) |
| Group Key: sp.track, sp.artist |
| -> Nested Loop Left Join (cost=0.28..14685.28 rows=526 width=72) |
| Join Filter: (h."time" <@ sp.play) |
| -> Index Scan using song_play_track_artist_idx on song_play sp (cost=0.28..8.29 rows=1 width=96) |
| Index Cond: ((track = 'a'::text) AND (artist = 'b'::text)) |
| -> Seq Scan on heartrate h (cost=0.00..8102.55 rows=525955 width=16) |

Note: the above plan results in A Full seq scan of the heartrate table, the largest table - not ideal at all!

I then decided to create the following function to see if it would help speed up queries. It converts a range eg tsrange('2020-01-01 00:00:00', '2020-01-02 00:00:00') to a conditional query eg field >= 2020年01月01日 00:00:00 and field < '2020-01-02 00:00:00' .

essentially the same as the <@ contains operator.

And it seems to work! Although this is only helpful for looking up a particular song_play's heartrate... not all of a track / artist's song_play's heartrates

CREATE OR REPLACE FUNCTION range_to_conditional(range anyrange, field text)
 RETURNS text
 LANGUAGE SQL
 IMMUTABLE STRICT AS
$$
SELECT case
 when isempty(range) then 'false'
 when upper_inf(range) and lower_inf(range) then 'true'
 when upper_inf(range) then case
 when lower_inc(range) then format(' %L <= %I ', lower(range), field)
 else format(' %L < %I ', lower(range), field)
 end
 when lower_inf(range) then case
 when upper_inc(range) then format(' %L >= %I ', upper(range), field)
 else format(' %L > %I ', upper(range), field)
 end
 else
 case
 when lower_inc(range) and upper_inc(range)
 then format(' %1$L <= %3$I AND %2$L >= %3$I ', lower(range), upper(range), field)
 when lower_inc(range)
 then format(' %1$L <= %3$I AND %2$L > %3$I ', lower(range), upper(range), field)
 when upper_inc(range)
 then format(' %1$L < %3$I AND %2$L >= %3$I ', lower(range), upper(range), field)
 else format(' %1$L < %3$I AND %2$L > %3$I ', lower(range), upper(range), field)
 end
 end
$$
;
create function avg_heartrate(sp song_play)
returns double precision as $$
DECLARE
 retval double precision ;
BEGIN
 EXECUTE format('select avg(h.value) from heartrate h where %s', range_to_conditional(sp.play, 'time'))
 INTO STRICT retval;
 RETURN retval;
END
$$
 LANGUAGE plpgsql stable;
SELECT sp.track, sp.artist, sp.play, avg_heartrate(sp) from song_play sp where sp.track='a' and sp.artist='b' limit 10;

✓
✓
track | artist | play | avg_heartrate 
:---- | :----- | :--------------------------------------------------- | :-----------------
a | b | ["2015年01月03日 00:00:00","2015年01月03日 00:03:42.413608") | 78.93074469582096 
a | b | ["2015年01月10日 00:00:00","2015年01月10日 00:01:32.299356") | 83.89127804586359 
a | b | ["2015年01月11日 00:00:00","2015年01月11日 00:03:24.722083") | 62.333722293527885
a | b | ["2015年01月19日 00:00:00","2015年01月19日 00:01:14.845757") | 77.65872734128969 
a | b | ["2015年01月30日 00:00:00","2015年01月30日 00:01:40.991165") | 102.88233680407437
a | b | ["2015年02月06日 00:00:00","2015年02月06日 00:03:51.264716") | 70.34797302970127 
a | b | ["2015年02月13日 00:00:00","2015年02月13日 00:01:23.358657") | 62.91734005187344 
a | b | ["2015年02月25日 00:00:00","2015年02月25日 00:02:04.856602") | 115.45533419257616
a | b | ["2015年02月28日 00:00:00","2015年02月28日 00:02:46.800728") | 117.39846990343175
a | b | ["2015年03月18日 00:00:00","2015年03月18日 00:02:54.893186") | 68.1618921408235

db<>fiddle here

Thanks!

Question 2

Change the join condition from

ON h.time <@ sp.play

to

ON h.time >= lower(sp.play) AND h.time < upper(sp.play)

(if your ranges are open at the right end, else use different inequality operators).

Then the nested loop join can use a regular b-tree index on heartrate(time) to speed up the inner query.

Question 3

Thanks this suits my needs, so I will mark it as accepted. But I was curious what you would do if not all ranges have the same inclusive/exclusive bounds. eg some [) whereas others []

Question 4

That would probably not work, but it is an unusual scenario.

Laurenz Albe Laurenz Albe 62.1k4 gold badges57 silver badges93 bronze badges · Accepted Answer · 2020-06-08 11:36:46Z

1

Change the join condition from

ON h.time <@ sp.play

to

ON h.time >= lower(sp.play) AND h.time < upper(sp.play)

(if your ranges are open at the right end, else use different inequality operators).

Then the nested loop join can use a regular b-tree index on heartrate(time) to speed up the inner query.

Share

Improve this answer

answered Jun 8, 2020 at 11:36

Laurenz Albe's user avatar

Laurenz Albe Laurenz Albe

62.1k4 gold badges57 silver badges93 bronze badges

2

Thanks this suits my needs, so I will mark it as accepted. But I was curious what you would do if not all ranges have the same inclusive/exclusive bounds. eg some [) whereas others []

jamesmstone
– jamesmstone

2020年06月16日 23:22:27 +00:00
Commented Jun 16, 2020 at 23:22
1

That would probably not work, but it is an unusual scenario.

Laurenz Albe
– Laurenz Albe

2020年06月17日 05:29:18 +00:00
Commented Jun 17, 2020 at 5:29

Add a comment |

Stack Exchange Network

Time series index range contains operator (reverse)

Schema

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Time series index range contains operator (reverse)

Schema

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions