How can I get a range of values from sparse data

Question 1

I have the following table (simplified):

create table tunnels
(
 interval varchar not null,
 ts_start timestamp not null,
 ts_end timestamp,
 price_top double precision not null,
 price_bottom double precision not null,
)

Those represent segments, delimited by ts_start, ts_end and a range price_top, price_bottom.

The data is sparse and rows are created only when something changes. There are several timeframes running at the same time, represented by 'interval'

I would like to make a query where I request a time range and get an array where I get the values for each interval existing.

Some sample data is here: https://pastebin.com/P1mugdtk

What I'm looking for is some array with interpolated values by the minute by interval (they're represented as strings: M1, M5, etc. right now), with rows like:

[ ts: xxx, m1_price_top: xx, m1_price_bottom: xx, m5: price_top: xx, m5: price_bottom: xx, ...]
[ ts: xxx, m1_price_top: xx, m1_price_bottom: xx, m5: price_top: xx, m5: price_bottom: xx, ...]

to add a bit more complexity, within a time range, we need the previous record, for each interval, to know where to start and ts_end can be null if there is no data that closes that segment (it's always the last record).

how can I achieve that in postgres?

Question 2

Your pastebin data is not runnable as a script - please put in the necessary brackets and quotes!

Question 3

@Vérace I've made an sql insert file: pastebin.com/weLFKzTn

Question 4

where I get the values for each interval existing. - I presume you mean "interval overlapping"?

Question 5

@Vérace, sorry it wasn't clear: the intervals are various strings (M5, M15, etc) and I'm looking for an array of arrays where for each interval, there is an array that's interpolated by the minute.

Question 6

I'm sorry, but I don't understand what you mean by interpolated by the minute - please take a small subset from the fiddle here and show me what you mean - before and after...

Question 7

This seems to me to be an excellent use case for PostgreSQL's range data type(s)!

I've created your table with a few changes as follows (all of the code below is available on the fiddle here):

CREATE TABLE tunnels
(
 t_interval TEXT NOT NULL, -- TEXT is better virtually **always**
 ts_start TIMESTAMPTZ NOT NULL, -- With timezone better if possible!
 ts_end TIMESTAMPTZ,
 price_top REAL NOT NULL, -- REAL should be good enough - adjust as you
 price_bottom REAL NOT NULL, -- see fit!
 CONSTRAINT ts_lt_te CHECK (ts_start < ts_end),
 CONSTRAINT ti_ts_te_uq UNIQUE (t_interval, ts_start, ts_end)
);

A few points to note:

the TEXT type is almost always preferable to VARCHAR (first field) - see here, here and here - (these posters are PostgreSQL experts!) - I also changed its name because INTERVAL is a PostgreSQL keyword. SQL keywords should never be used as field or table names, it makes debugging difficult and makes your app non-portable
again, the TIMESTAMPTZ is virtually always preferable to TIMESTAMP - see here, here and here
I've used REAL instead of DOUBLE PRECISION - this just depends on how much precision you actually need - the former takes up 4 bytes, the latter 8. See here for details. Basically, you should only use as much precision as you require, but I'm not aware of your requirements in this area, so this is just a suggestion - à vous le choix!
I've added a couple of constraints which would help to weed out bad data - there's no bad data in your sample (however, see below)
I've hidden the table insert statement snippet in the fiddle (pain to scroll through all the time - it's just below the table definition)

So, now we come to PostgreSQL's range types (Functions and Opertors (manual), also see here and here for a couple of nice use cases). PostgreSQL is the only mainstream RDBMS which has these out of the box.

We will use the TSTZRANGE (range of times with TIMESTAMPTZ) type as follows:

SELECT 
 t_interval AS t_int, 
 TSTZRANGE(ts_start, ts_end) AS range,
 price_bottom AS pb, price_top AS pt,
 ts_start AS ts, ts_end AS te
FROM 
 tunnels
WHERE ts_start >= '2023-10-02' AND ts_end < '2023-10-04'
 AND 
 TSTZRANGE(ts_start, ts_end) 
 @> 
 '2023-10-03 19:20:00'::TIMESTAMPTZ 
ORDER BY 
 t_interval, ts_start
LIMIT 20;

Result:

t_int range pb pt ts te
 H1 ["2023年10月03日 18:00:00+00","2023年10月03日 21:00:00+00") 27212.3 27479.5 2023年10月03日 18:00:00+00 2023年10月03日 21:00:00+00
 H2 ["2023年10月03日 06:00:00+00","2023年10月03日 22:00:00+00") 27255 27633.3 2023年10月03日 06:00:00+00 2023年10月03日 22:00:00+00
 M15 ["2023年10月03日 19:15:00+00","2023年10月03日 19:30:00+00") 27230 27479.5 2023年10月03日 19:15:00+00 2023年10月03日 19:30:00+00
 M30 ["2023年10月03日 18:00:00+00","2023年10月03日 20:30:00+00") 27212.3 27479.5 2023年10月03日 18:00:00+00 2023年10月03日 20:30:00+00
 M5 ["2023年10月03日 19:05:00+00","2023年10月03日 19:25:00+00") 27230 27479.5 2023年10月03日 19:05:00+00 2023年10月03日 19:25:00+00

(I've put another couple of examples at the bottom of the fiddle.)

Check out the @> notation - it means "Containment" - you can check out the other operators from the manual.
Note the bounds of the range - starts with [ and ends with ) - the [ is inclusive whereas the ) is exclusive. This is very important!

In the text form of a range, an inclusive lower bound is represented by "[" while an exclusive lower bound is represented by "(". Likewise, an inclusive upper bound is represented by "]", while an exclusive upper bound is represented by ")". (See Section 8.17.5 for more details).

If you run SELECT * FROM tunnels ORDER BY t_interval, ts_start LIMIT 2; with your sample data, the result is:

t_interval ts_start ts_end price_top price_bottom
 D1 2023年10月03日 00:00:00+00 2023年10月04日 00:00:00+00 28613 25965
 D1 2023年10月04日 00:00:00+00 2023年10月09日 00:00:00+00 28613 27123

Now, you can see where ts_end of the first record matches the ts_start of the second - conceivably you could have two different prices at the same point in time - this is probably not a good idea. I ran a query (see fiddle) and your data doesn't appear to suffer from this, but it could.

Now, notice the range colums - they start with [ and end with ) - so no overlaps! For any instant, there can only be one value of price_bottom, price_top - see my question in the fiddle as to why values match in succeeding records.

Now, we come to the pièce de résistance of PostgreSQL's capabilities in this area! First we'll do an INSERT as follows:

--
-- Works, but overlaps with (see above)
-- D1 2023年10月03日 2023年10月04日 2 2 2023年10月04日 27123 28613
--
INSERT INTO tunnels VALUES ('D1', '2023-10-03 01:00:00+00', '2023-10-04 05:00:00+00', 2, 2);

We don't want overlaps to be possible!

So, we do the following: CREATE EXTENSION btree_gist;. From the documentation, we have:

btree_gist provides GiST index operator classes that implement B-tree equivalent behavior for the data types int2, int4, int8, float4, float8, numeric, timestamp with time zone, timestamp without time zone, time with time zone, time without time zone, date, interval, oid, money, char, varchar, text, bytea, bit, varbit, macaddr, macaddr8, inet, cidr, uuid, bool and all enum types.

Note "text" (and "varchar") - without the extension, these types won't work with GiST indexes - not for btree like indexes anyway - try running the fiddle without the extension.

There is a whole ecosystem of PostgreSQL extensions (another feature unique to PostgreSQL), here are the "contrib" modules that are available - and this is only scratching the surface - see here, here & here. Contrib modules are not part of the core, but held in high esteem and distributed with the core.

After DELETE-ing the offending record, we create an index:

CREATE INDEX ti_ts_te_range_ix 
ON tunnels
USING GIST (t_interval, TSTZRANGE(ts_start, ts_end));

and we alter our table and establish a non-overlapping constraint as follows:

ALTER TABLE tunnels ADD CONSTRAINT ts_te_no_overlap_ct
EXCLUDE USING GIST (t_interval WITH =, TSTZRANGE(ts_start, ts_end) WITH &&);

and then we retry our INSERT - Result:

ERROR: conflicting key value violates exclusion constraint "ts_te_no_overlap_ct"
DETAIL: Key (t_interval, tstzrange(ts_start, ts_end))=(D1, ["2023年10月03日 01:00:00+00","2023年10月04日 05:00:00+00")) 
conflicts with existing key (t_interval, tstzrange(ts_start, ts_end))=(D1, ["2023年10月03日 00:00:00+00","2023年10月04日 00:00:00+00")).

So, I hope this answers your question or at least goes some way towards doing so. You say you want an array of arrays - I'm not sure if that's a good idea - manipulating long strings is not SQL's forté - it's far better at "slicing and dicing" data "atoms" - i.e. no .csv fields for example - see Codd's rules:

Rule 2: The guaranteed access rule:

Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

Two final points:

1. PostgreSQL recently introduced the multirange type - that might be of interest to you if you wish to have long strings (a processing nightmare!) - a couple of articles - code before multirange and code after!
1. If you like, another thing you could do is to use a GENERATED column - it can make your SQL more elegant at the price of a bit of disk space - you can do this as follows (see small sample fiddle here):
ALTER TABLE test ADD COLUMN z TSTZRANGE GENERATED ALWAYS AS (TSTZRANGE(x, y)) STORED;

If the answer above doesn't satisfy you, well maybe the multirange approach can be adapted to your requirements - ping me on the comments! Interesting question - I learnt a lot (+1)!

Vérace Vérace 31k9 gold badges73 silver badges86 bronze badges · Answer 1 · 2023-10-26 00:05:03Z

This seems to me to be an excellent use case for PostgreSQL's range data type(s)!

I've created your table with a few changes as follows (all of the code below is available on the fiddle here):

CREATE TABLE tunnels
(
 t_interval TEXT NOT NULL, -- TEXT is better virtually **always**
 ts_start TIMESTAMPTZ NOT NULL, -- With timezone better if possible!
 ts_end TIMESTAMPTZ,
 price_top REAL NOT NULL, -- REAL should be good enough - adjust as you
 price_bottom REAL NOT NULL, -- see fit!
 CONSTRAINT ts_lt_te CHECK (ts_start < ts_end),
 CONSTRAINT ti_ts_te_uq UNIQUE (t_interval, ts_start, ts_end)
);

A few points to note:

the TEXT type is almost always preferable to VARCHAR (first field) - see here, here and here - (these posters are PostgreSQL experts!) - I also changed its name because INTERVAL is a PostgreSQL keyword. SQL keywords should never be used as field or table names, it makes debugging difficult and makes your app non-portable
again, the TIMESTAMPTZ is virtually always preferable to TIMESTAMP - see here, here and here
I've used REAL instead of DOUBLE PRECISION - this just depends on how much precision you actually need - the former takes up 4 bytes, the latter 8. See here for details. Basically, you should only use as much precision as you require, but I'm not aware of your requirements in this area, so this is just a suggestion - à vous le choix!
I've added a couple of constraints which would help to weed out bad data - there's no bad data in your sample (however, see below)
I've hidden the table insert statement snippet in the fiddle (pain to scroll through all the time - it's just below the table definition)

So, now we come to PostgreSQL's range types (Functions and Opertors (manual), also see here and here for a couple of nice use cases). PostgreSQL is the only mainstream RDBMS which has these out of the box.

We will use the TSTZRANGE (range of times with TIMESTAMPTZ) type as follows:

SELECT 
 t_interval AS t_int, 
 TSTZRANGE(ts_start, ts_end) AS range,
 price_bottom AS pb, price_top AS pt,
 ts_start AS ts, ts_end AS te
FROM 
 tunnels
WHERE ts_start >= '2023-10-02' AND ts_end < '2023-10-04'
 AND 
 TSTZRANGE(ts_start, ts_end) 
 @> 
 '2023-10-03 19:20:00'::TIMESTAMPTZ 
ORDER BY 
 t_interval, ts_start
LIMIT 20;

Result:

t_int range pb pt ts te
 H1 ["2023年10月03日 18:00:00+00","2023年10月03日 21:00:00+00") 27212.3 27479.5 2023年10月03日 18:00:00+00 2023年10月03日 21:00:00+00
 H2 ["2023年10月03日 06:00:00+00","2023年10月03日 22:00:00+00") 27255 27633.3 2023年10月03日 06:00:00+00 2023年10月03日 22:00:00+00
 M15 ["2023年10月03日 19:15:00+00","2023年10月03日 19:30:00+00") 27230 27479.5 2023年10月03日 19:15:00+00 2023年10月03日 19:30:00+00
 M30 ["2023年10月03日 18:00:00+00","2023年10月03日 20:30:00+00") 27212.3 27479.5 2023年10月03日 18:00:00+00 2023年10月03日 20:30:00+00
 M5 ["2023年10月03日 19:05:00+00","2023年10月03日 19:25:00+00") 27230 27479.5 2023年10月03日 19:05:00+00 2023年10月03日 19:25:00+00

(I've put another couple of examples at the bottom of the fiddle.)

Check out the @> notation - it means "Containment" - you can check out the other operators from the manual.
Note the bounds of the range - starts with [ and ends with ) - the [ is inclusive whereas the ) is exclusive. This is very important!

In the text form of a range, an inclusive lower bound is represented by "[" while an exclusive lower bound is represented by "(". Likewise, an inclusive upper bound is represented by "]", while an exclusive upper bound is represented by ")". (See Section 8.17.5 for more details).

If you run SELECT * FROM tunnels ORDER BY t_interval, ts_start LIMIT 2; with your sample data, the result is:

t_interval ts_start ts_end price_top price_bottom
 D1 2023年10月03日 00:00:00+00 2023年10月04日 00:00:00+00 28613 25965
 D1 2023年10月04日 00:00:00+00 2023年10月09日 00:00:00+00 28613 27123

Now, you can see where ts_end of the first record matches the ts_start of the second - conceivably you could have two different prices at the same point in time - this is probably not a good idea. I ran a query (see fiddle) and your data doesn't appear to suffer from this, but it could.

Now, notice the range colums - they start with [ and end with ) - so no overlaps! For any instant, there can only be one value of price_bottom, price_top - see my question in the fiddle as to why values match in succeeding records.

Now, we come to the pièce de résistance of PostgreSQL's capabilities in this area! First we'll do an INSERT as follows:

--
-- Works, but overlaps with (see above)
-- D1 2023年10月03日 2023年10月04日 2 2 2023年10月04日 27123 28613
--
INSERT INTO tunnels VALUES ('D1', '2023-10-03 01:00:00+00', '2023-10-04 05:00:00+00', 2, 2);

We don't want overlaps to be possible!

So, we do the following: CREATE EXTENSION btree_gist;. From the documentation, we have:

btree_gist provides GiST index operator classes that implement B-tree equivalent behavior for the data types int2, int4, int8, float4, float8, numeric, timestamp with time zone, timestamp without time zone, time with time zone, time without time zone, date, interval, oid, money, char, varchar, text, bytea, bit, varbit, macaddr, macaddr8, inet, cidr, uuid, bool and all enum types.

Note "text" (and "varchar") - without the extension, these types won't work with GiST indexes - not for btree like indexes anyway - try running the fiddle without the extension.

There is a whole ecosystem of PostgreSQL extensions (another feature unique to PostgreSQL), here are the "contrib" modules that are available - and this is only scratching the surface - see here, here & here. Contrib modules are not part of the core, but held in high esteem and distributed with the core.

After DELETE-ing the offending record, we create an index:

CREATE INDEX ti_ts_te_range_ix 
ON tunnels
USING GIST (t_interval, TSTZRANGE(ts_start, ts_end));

and we alter our table and establish a non-overlapping constraint as follows:

ALTER TABLE tunnels ADD CONSTRAINT ts_te_no_overlap_ct
EXCLUDE USING GIST (t_interval WITH =, TSTZRANGE(ts_start, ts_end) WITH &&);

and then we retry our INSERT - Result:

ERROR: conflicting key value violates exclusion constraint "ts_te_no_overlap_ct"
DETAIL: Key (t_interval, tstzrange(ts_start, ts_end))=(D1, ["2023年10月03日 01:00:00+00","2023年10月04日 05:00:00+00")) 
conflicts with existing key (t_interval, tstzrange(ts_start, ts_end))=(D1, ["2023年10月03日 00:00:00+00","2023年10月04日 00:00:00+00")).

So, I hope this answers your question or at least goes some way towards doing so. You say you want an array of arrays - I'm not sure if that's a good idea - manipulating long strings is not SQL's forté - it's far better at "slicing and dicing" data "atoms" - i.e. no .csv fields for example - see Codd's rules:

Rule 2: The guaranteed access rule:

Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

Two final points:

1. PostgreSQL recently introduced the multirange type - that might be of interest to you if you wish to have long strings (a processing nightmare!) - a couple of articles - code before multirange and code after!
1. If you like, another thing you could do is to use a GENERATED column - it can make your SQL more elegant at the price of a bit of disk space - you can do this as follows (see small sample fiddle here):
ALTER TABLE test ADD COLUMN z TSTZRANGE GENERATED ALWAYS AS (TSTZRANGE(x, y)) STORED;

If the answer above doesn't satisfy you, well maybe the multirange approach can be adapted to your requirements - ping me on the comments! Interesting question - I learnt a lot (+1)!

Stack Exchange Network

How can I get a range of values from sparse data

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

How can I get a range of values from sparse data

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions