I have a table containing multiple time series of different types. The timestamps of cohesive points from different series do not match exactly (i. e. the difference can be up to an hour).
Schema
Below is the schema with two example series:
CREATE TABLE series (id integer, series_type integer, charttime timestamp,
value integer, PRIMARY KEY (id));
INSERT INTO series VALUES (1, 1, '2018-03-01 12:10:00', 40),
(2, 1, '2018-03-01 13:25:00', 30), (3, 1, '2018-03-01 14:10:00', 50);
INSERT INTO series VALUES (4, 2, '2018-03-01 11:20:00', 2), (5, 2, '2018-03-01 12:15:00', 6),
(6, 2, '2018-03-01 13:00:00', 7), (7, 2, '2018-03-01 13:45:00', 1);
id |series_type |charttime |value |
---|------------|--------------------|------|
1 |1 |2018年03月01日 12:10:00 |40 |
2 |1 |2018年03月01日 13:25:00 |30 |
3 |1 |2018年03月01日 14:10:00 |50 |
4 |2 |2018年03月01日 11:20:00 |2 |
5 |2 |2018年03月01日 12:15:00 |6 |
7 |2 |2018年03月01日 13:45:00 |1 |
6 |2 |2018年03月01日 13:00:00 |7 |
Goal
The goal is to select one series together with the closest datapoint from another series. For the example dataset the result should be:
charttime |s1 |s2 |
--------------------|---|---|
2018年03月01日 12:10:00 |40 |6 |
2018年03月01日 13:25:00 |30 |1 |
2018年03月01日 14:10:00 |50 |1 |
First working approach
My current approach is to select the best matching data point from the other series by a subquery:
SELECT l.charttime, l.value AS s1,
( SELECT r.value
FROM series r
WHERE ABS( EXTRACT( EPOCH FROM l.charttime - r.charttime ) / 3600 ) < 1
AND r.series_type = 2
ORDER BY ABS( EXTRACT( EPOCH FROM l.charttime - r.charttime )) ASC LIMIT 1
) AS s2
FROM series l
WHERE l.series_type = 1
ORDER BY l.charttime ASC
This does not seem to be the best approach as the dataset is quite huge and thus performing many subqueries slows down the query.
Second approach
A different idea is to self-join the table and filter for close data timestamps:
SELECT l.charttime, l.value AS s1, r.charttime, r.value AS s2
FROM series l, series r
WHERE abs(EXTRACT(EPOCH FROM l.charttime - r.charttime) / 3600) < 1
AND l.series_type = 1 AND r.series_type = 2
charttime |s1 |charttime |s2 |
--------------------|---|--------------------|---|
2018年03月01日 12:10:00 |40 |2018年03月01日 11:20:00 |2 |
2018年03月01日 12:10:00 |40 |2018年03月01日 12:15:00 |6 |
2018年03月01日 12:10:00 |40 |2018年03月01日 13:00:00 |7 |
2018年03月01日 13:25:00 |30 |2018年03月01日 13:45:00 |1 |
2018年03月01日 13:25:00 |30 |2018年03月01日 13:00:00 |7 |
2018年03月01日 14:10:00 |50 |2018年03月01日 13:45:00 |1 |
The problem then are the duplicate data points. Grouping in the first column does not work as the best matching s2
cannot be selected.
Is there a better approach?
3 Answers 3
In your second approach, with the self join, you could remove duplicates using row_number()
,
Partition by l.charttime, order by the time difference and filter for row_number = 1.
I think performance will be horrible, however. Because of the cartesian join this will be an O(size(series 1) x size(series 2)) operation.
Having both l.charttime and r.charttime inside functions may also be causing trouble. Try refactoring to (in pseudo code)
r.charttime < l.charttime + 3600
and r.charttime > l.charttime - 3600
.. and see how the query plan looks. I presume there's an index on charttime. Without one no approach will be fast. Indeed, two partial indexes, one on Series 1 and one on Series 2 may be even better.
-
Your idea with parition by works quite will, it reduced the query time from around 20s to 1s. I have posted an answer with the code I ended up with.stsc– stsc2018年07月10日 11:11:46 +00:00Commented Jul 10, 2018 at 11:11
I'm reading your business rule as "get each row from series 1 and the closest row from series 2, if any, which must be within 3600 of series 1's value." I'm wondering if a good solution to this wouldn't be a cursor. Well, two cursors.
The basic algorithm is to find the two series 2 values that straddle each series 1 value i.e. the one just before and the one just after in time sequence. Then use whichever of these is closer. It would look something like this:
declare two variables: Smaller(datetime, value) and Larger(datetime, value).
initialize the variables to their domain minimum value e.g. (1900年01月01日 00:00:00, 0).
open a cursor on series 1, in time order
open a cursor on series 2, in time order
while rows remain in Series1
while Larger.datetime < Series1.datetime
Read next Series2
set Smaller = Larger
set Larger = Series2
// Add logic for when Series2 is exhausted
end
// We know Smaller is less than Series1.datetime and Larger is greater than or equal,
// or there's a case not documented in the question.
// Check for Smaller, Larger within the 3600 window to be added.
if (Series1.datetime - Smaller.datetime) < (Larger.datetime - Series1.datetime)
use Smaller.value
else
use Larger.value
end
read next Series1
end
I've omitted a lot of the niceties, obviously. There will be some cases about having multiple series 1 values before the first series 2 value, or after the last series 2 value. Also if there is no match in series 2 for a given series 1 value. Your description does not mention the rules for these but I'm sure you'll be able to work them in.
This requires both series' values to be in time sequence. The sort could be expensive. However, there would have to be an index on this column for any solution to be workable. So the query would have a time-ordered access path into the data already, and, likely, there would be no actual sort at runtime.
The time complexity of this is O(size(series 1) + size(series 2)) i.e. O(N) rather than O(N^2) of the self-cross-join.
Based on the idea from Michael Green I ended up with the following:
WITH c AS (
SELECT
l.charttime,
l.value AS s1,
r.value AS s2,
rank() OVER (PARTITION BY l.charttime ORDER BY abs(EXTRACT(EPOCH FROM l.charttime - r.charttime) / 3600) ASC) AS rnk
FROM
series l, series r
WHERE
abs(EXTRACT(EPOCH FROM l.charttime - r.charttime) / 3600) < 1
AND l.series_type = 1
AND r.series_type = 2
)
SELECT charttime, s1, s2 FROM c WHERE rnk = 1 ORDER BY charttime
The query time is around 1s compared to 20s in my original approach.
s2
come fromseries_type
= 2.series_type = 1
. As pointed out,s2
are the matching points fromseries_type=2
.