I have the following table (simplified):
create table tunnels
(
interval varchar not null,
ts_start timestamp not null,
ts_end timestamp,
price_top double precision not null,
price_bottom double precision not null,
)
Those represent segments, delimited by ts_start, ts_end and a range price_top, price_bottom.
The data is sparse and rows are created only when something changes. There are several timeframes running at the same time, represented by 'interval'
I would like to make a query where I request a time range and get an array where I get the values for each interval existing.
Some sample data is here: https://pastebin.com/P1mugdtk
What I'm looking for is some array with interpolated values by the minute by interval (they're represented as strings: M1, M5, etc. right now), with rows like:
[ ts: xxx, m1_price_top: xx, m1_price_bottom: xx, m5: price_top: xx, m5: price_bottom: xx, ...]
[ ts: xxx, m1_price_top: xx, m1_price_bottom: xx, m5: price_top: xx, m5: price_bottom: xx, ...]
to add a bit more complexity, within a time range, we need the previous record, for each interval, to know where to start and ts_end can be null if there is no data that closes that segment (it's always the last record).
how can I achieve that in postgres?
1 Answer 1
This seems to me to be an excellent use case for PostgreSQL's range data type(s)!
I've created your table with a few changes as follows (all of the code below is available on the fiddle here):
CREATE TABLE tunnels
(
t_interval TEXT NOT NULL, -- TEXT is better virtually **always**
ts_start TIMESTAMPTZ NOT NULL, -- With timezone better if possible!
ts_end TIMESTAMPTZ,
price_top REAL NOT NULL, -- REAL should be good enough - adjust as you
price_bottom REAL NOT NULL, -- see fit!
CONSTRAINT ts_lt_te CHECK (ts_start < ts_end),
CONSTRAINT ti_ts_te_uq UNIQUE (t_interval, ts_start, ts_end)
);
A few points to note:
the
TEXT
type is almost always preferable toVARCHAR
(first field) - seehere
,here
andhere
- (these posters are PostgreSQL experts!) - I also changed its name becauseINTERVAL
is a PostgreSQL keyword.SQL keywords
should never be used as field or table names, it makes debugging difficult and makes your app non-portableagain, the
TIMESTAMPTZ
is virtually always preferable toTIMESTAMP
- seehere
,here
andhere
I've used
REAL
instead ofDOUBLE PRECISION
- this just depends on how much precision you actually need - the former takes up 4 bytes, the latter 8. Seehere
for details. Basically, you should only use as much precision as you require, but I'm not aware of your requirements in this area, so this is just a suggestion - à vous le choix!I've added a couple of constraints which would help to weed out bad data - there's no bad data in your sample (however, see below)
I've hidden the table insert statement snippet in the fiddle (pain to scroll through all the time - it's just below the table definition)
So, now we come to PostgreSQL's range types (Functions and Opertors (manual
), also see here
and here
for a couple of nice use cases). PostgreSQL is the only mainstream RDBMS which has these out of the box.
We will use the TSTZRANGE
(range of times with TIMESTAMPTZ
) type as follows:
SELECT
t_interval AS t_int,
TSTZRANGE(ts_start, ts_end) AS range,
price_bottom AS pb, price_top AS pt,
ts_start AS ts, ts_end AS te
FROM
tunnels
WHERE ts_start >= '2023-10-02' AND ts_end < '2023-10-04'
AND
TSTZRANGE(ts_start, ts_end)
@>
'2023-10-03 19:20:00'::TIMESTAMPTZ
ORDER BY
t_interval, ts_start
LIMIT 20;
Result:
t_int range pb pt ts te
H1 ["2023年10月03日 18:00:00+00","2023年10月03日 21:00:00+00") 27212.3 27479.5 2023年10月03日 18:00:00+00 2023年10月03日 21:00:00+00
H2 ["2023年10月03日 06:00:00+00","2023年10月03日 22:00:00+00") 27255 27633.3 2023年10月03日 06:00:00+00 2023年10月03日 22:00:00+00
M15 ["2023年10月03日 19:15:00+00","2023年10月03日 19:30:00+00") 27230 27479.5 2023年10月03日 19:15:00+00 2023年10月03日 19:30:00+00
M30 ["2023年10月03日 18:00:00+00","2023年10月03日 20:30:00+00") 27212.3 27479.5 2023年10月03日 18:00:00+00 2023年10月03日 20:30:00+00
M5 ["2023年10月03日 19:05:00+00","2023年10月03日 19:25:00+00") 27230 27479.5 2023年10月03日 19:05:00+00 2023年10月03日 19:25:00+00
(I've put another couple of examples at the bottom of the fiddle.)
Check out the
@>
notation - it means "Containment" - you can check out the other operators from the manual.Note the
bounds
of the range - starts with[
and ends with)
- the[
is inclusive whereas the)
is exclusive. This is very important!
In the text form of a range, an inclusive lower bound is represented by "[" while an exclusive lower bound is represented by "(". Likewise, an inclusive upper bound is represented by "]", while an exclusive upper bound is represented by ")". (See Section 8.17.5 for more details).
If you run SELECT * FROM tunnels ORDER BY t_interval, ts_start LIMIT 2;
with your sample data, the result is:
t_interval ts_start ts_end price_top price_bottom
D1 2023年10月03日 00:00:00+00 2023年10月04日 00:00:00+00 28613 25965
D1 2023年10月04日 00:00:00+00 2023年10月09日 00:00:00+00 28613 27123
Now, you can see where ts_end
of the first record matches the ts_start
of the second - conceivably you could have two different prices at the same point in time - this is probably not a good idea. I ran a query (see fiddle) and your data doesn't appear to suffer from this, but it could.
Now, notice the range colums - they start with [
and end with )
- so no overlaps! For any instant, there can only be one value of price_bottom
, price_top
- see my question in the fiddle as to why values match in succeeding records.
Now, we come to the pièce de résistance of PostgreSQL's capabilities in this area! First we'll do an INSERT
as follows:
--
-- Works, but overlaps with (see above)
-- D1 2023年10月03日 2023年10月04日 2 2 2023年10月04日 27123 28613
--
INSERT INTO tunnels VALUES ('D1', '2023-10-03 01:00:00+00', '2023-10-04 05:00:00+00', 2, 2);
We don't want overlaps to be possible!
So, we do the following: CREATE EXTENSION btree_gist;
. From the documentation, we have:
btree_gist provides GiST index operator classes that implement B-tree equivalent behavior for the data types int2, int4, int8, float4, float8, numeric, timestamp with time zone, timestamp without time zone, time with time zone, time without time zone, date, interval, oid, money, char, varchar, text, bytea, bit, varbit, macaddr, macaddr8, inet, cidr, uuid, bool and all enum types.
Note "text"
(and "varchar"
) - without the extension, these types won't work with GiST indexes
- not for btree like indexes anyway - try running the fiddle without the extension.
There is a whole ecosystem of PostgreSQL extensions (another feature unique to PostgreSQL), here
are the "contrib" modules that are available - and this is only scratching the surface - see here, here & here. Contrib modules are not part of the core, but held in high esteem and distributed with the core.
After DELETE
-ing the offending record, we create an index:
CREATE INDEX ti_ts_te_range_ix
ON tunnels
USING GIST (t_interval, TSTZRANGE(ts_start, ts_end));
and we alter our table and establish a non-overlapping constraint as follows:
ALTER TABLE tunnels ADD CONSTRAINT ts_te_no_overlap_ct
EXCLUDE USING GIST (t_interval WITH =, TSTZRANGE(ts_start, ts_end) WITH &&);
and then we retry our INSERT
- Result:
ERROR: conflicting key value violates exclusion constraint "ts_te_no_overlap_ct"
DETAIL: Key (t_interval, tstzrange(ts_start, ts_end))=(D1, ["2023年10月03日 01:00:00+00","2023年10月04日 05:00:00+00"))
conflicts with existing key (t_interval, tstzrange(ts_start, ts_end))=(D1, ["2023年10月03日 00:00:00+00","2023年10月04日 00:00:00+00")).
So, I hope this answers your question or at least goes some way towards doing so. You say you want an array of arrays
- I'm not sure if that's a good idea - manipulating long strings is not SQL's forté - it's far better at "slicing and dicing" data "atoms" - i.e. no .csv fields for example - see Codd's rules:
Rule 2: The guaranteed access rule:
Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.
Two final points:
-
- If you like, another thing you could do is to use a
GENERATED
column - it can make your SQL more elegant at the price of a bit of disk space - you can do this as follows (see small sample fiddle here):
ALTER TABLE test ADD COLUMN z TSTZRANGE GENERATED ALWAYS AS (TSTZRANGE(x, y)) STORED;
- If you like, another thing you could do is to use a
If the answer above doesn't satisfy you, well maybe the multirange approach can be adapted to your requirements - ping me on the comments! Interesting question - I learnt a lot (+1)!
where I get the values for each interval existing.
- I presume you mean "interval overlapping"?interpolated by the minute
- please take a small subset from the fiddle here and show me what you mean - before and after...