2

Hi I am trying to optimise the a timestamp range contains <@ query for Postgres 12

I have done a bit of reading postgres documentation and discovered only GiST and SP-GiST indexes support this operator. However, I can't add one of these ( I tink I would need to add one to the heartrate table - see Schema below, but that is not a range type...).

My question is similar to this question and this one which also indicate I will need a GiST index. However, they are the other way around, eg the column they have a single timestamp and want to return from a table of tsranges all records that are contained. I have a table of timestamps and want to join it to a table of tsranges

For a bit of context for my schema, I have a collection of heartrates in the real dataset sampled ~1/3seconds, and a list of songs I have listen to, and when. I would like to query things like

  • avg(heartrate) for a particular track and artist
  • avg(heartrate) for a particular artist
  • etc.

Schema

create table heartrate (
 "time" timestamp primary key ,
 value float
)
;
CREATE INDEX ON heartrate ("time", value);
-- CREATE INDEX ON heartrate USING GIST ("time", value); can't do as "time" is not a range column.
-- one gets the following error: 
--- ERROR: data type timestamp without time zone has no default operator class for access method "gist" Hint: You must specify an operator class for the index or define a default operator class for the data type.
create table song_play(
 track TEXT NOT NULL,
 artist TEXT NOT NULL,
 play tsrange not null
)
;
CREATE INDEX ON song_play(track, artist);
INSERT INTO heartrate("time", value)
SELECT d, 60+60*random()
FROM generate_series('2015-01-01 00:00:00'::timestamp, '2020-01-01 00:00:00'::timestamp, '5 min'::interval) d
;
INSERT INTO song_play(track,artist, play)
SELECT case when random() > 0.5 then 'a' when random() > 0.5 then 'b' else 'c' end 
, case when random() > 0.5 then 'a' when random() > 0.5 then 'b' else 'c' end
, tsrange(d, d+ (((random()*3+1)::text|| 'min')::interval))
FROM generate_series('2015-01-01 00:00:00'::timestamp, '2020-01-01 00:00:00'::timestamp, '1 day'::interval) d
;
EXPLAIN SELECT sp.track, sp.artist, avg(h.value) FROM song_play sp left join heartrate h ON h.time <@ sp.play where sp.track='a' and sp.artist='b' GROUP BY sp.track, sp.artist;

Which results in the following:

✓
✓
✓
✓
525889 rows affected
1827 rows affected
| QUERY PLAN |
| :--------------------------------------------------------------------------------------------------------- |
| GroupAggregate (cost=0.28..14689.24 rows=1 width=72) |
| Group Key: sp.track, sp.artist |
| -> Nested Loop Left Join (cost=0.28..14685.28 rows=526 width=72) |
| Join Filter: (h."time" <@ sp.play) |
| -> Index Scan using song_play_track_artist_idx on song_play sp (cost=0.28..8.29 rows=1 width=96) |
| Index Cond: ((track = 'a'::text) AND (artist = 'b'::text)) |
| -> Seq Scan on heartrate h (cost=0.00..8102.55 rows=525955 width=16) |

Note: the above plan results in A Full seq scan of the heartrate table, the largest table - not ideal at all!

I then decided to create the following function to see if it would help speed up queries. It converts a range eg tsrange('2020-01-01 00:00:00', '2020-01-02 00:00:00') to a conditional query eg field >= 2020年01月01日 00:00:00 and field < '2020-01-02 00:00:00' .

essentially the same as the <@ contains operator.

And it seems to work! Although this is only helpful for looking up a particular song_play's heartrate... not all of a track / artist's song_play's heartrates

CREATE OR REPLACE FUNCTION range_to_conditional(range anyrange, field text)
 RETURNS text
 LANGUAGE SQL
 IMMUTABLE STRICT AS
$$
SELECT case
 when isempty(range) then 'false'
 when upper_inf(range) and lower_inf(range) then 'true'
 when upper_inf(range) then case
 when lower_inc(range) then format(' %L <= %I ', lower(range), field)
 else format(' %L < %I ', lower(range), field)
 end
 when lower_inf(range) then case
 when upper_inc(range) then format(' %L >= %I ', upper(range), field)
 else format(' %L > %I ', upper(range), field)
 end
 else
 case
 when lower_inc(range) and upper_inc(range)
 then format(' %1$L <= %3$I AND %2$L >= %3$I ', lower(range), upper(range), field)
 when lower_inc(range)
 then format(' %1$L <= %3$I AND %2$L > %3$I ', lower(range), upper(range), field)
 when upper_inc(range)
 then format(' %1$L < %3$I AND %2$L >= %3$I ', lower(range), upper(range), field)
 else format(' %1$L < %3$I AND %2$L > %3$I ', lower(range), upper(range), field)
 end
 end
$$
;
create function avg_heartrate(sp song_play)
returns double precision as $$
DECLARE
 retval double precision ;
BEGIN
 EXECUTE format('select avg(h.value) from heartrate h where %s', range_to_conditional(sp.play, 'time'))
 INTO STRICT retval;
 RETURN retval;
END
$$
 LANGUAGE plpgsql stable;
SELECT sp.track, sp.artist, sp.play, avg_heartrate(sp) from song_play sp where sp.track='a' and sp.artist='b' limit 10;
✓
✓
track | artist | play | avg_heartrate 
:---- | :----- | :--------------------------------------------------- | :-----------------
a | b | ["2015年01月03日 00:00:00","2015年01月03日 00:03:42.413608") | 78.93074469582096 
a | b | ["2015年01月10日 00:00:00","2015年01月10日 00:01:32.299356") | 83.89127804586359 
a | b | ["2015年01月11日 00:00:00","2015年01月11日 00:03:24.722083") | 62.333722293527885
a | b | ["2015年01月19日 00:00:00","2015年01月19日 00:01:14.845757") | 77.65872734128969 
a | b | ["2015年01月30日 00:00:00","2015年01月30日 00:01:40.991165") | 102.88233680407437
a | b | ["2015年02月06日 00:00:00","2015年02月06日 00:03:51.264716") | 70.34797302970127 
a | b | ["2015年02月13日 00:00:00","2015年02月13日 00:01:23.358657") | 62.91734005187344 
a | b | ["2015年02月25日 00:00:00","2015年02月25日 00:02:04.856602") | 115.45533419257616
a | b | ["2015年02月28日 00:00:00","2015年02月28日 00:02:46.800728") | 117.39846990343175
a | b | ["2015年03月18日 00:00:00","2015年03月18日 00:02:54.893186") | 68.1618921408235 

db<>fiddle here

Thanks!

asked Jun 8, 2020 at 2:57

1 Answer 1

1

Change the join condition from

ON h.time <@ sp.play

to

ON h.time >= lower(sp.play) AND h.time < upper(sp.play)

(if your ranges are open at the right end, else use different inequality operators).

Then the nested loop join can use a regular b-tree index on heartrate(time) to speed up the inner query.

answered Jun 8, 2020 at 11:36
2
  • Thanks this suits my needs, so I will mark it as accepted. But I was curious what you would do if not all ranges have the same inclusive/exclusive bounds. eg some [) whereas others [] Commented Jun 16, 2020 at 23:22
  • 1
    That would probably not work, but it is an unusual scenario. Commented Jun 17, 2020 at 5:29

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.