Computing Matching Data Points from Fuzzy Timestamps in Postgresql

Question 1

I have a table containing multiple time series of different types. The timestamps of cohesive points from different series do not match exactly (i. e. the difference can be up to an hour).

Schema

Below is the schema with two example series:

CREATE TABLE series (id integer, series_type integer, charttime timestamp,
 value integer, PRIMARY KEY (id));
INSERT INTO series VALUES (1, 1, '2018-03-01 12:10:00', 40),
 (2, 1, '2018-03-01 13:25:00', 30), (3, 1, '2018-03-01 14:10:00', 50);
INSERT INTO series VALUES (4, 2, '2018-03-01 11:20:00', 2), (5, 2, '2018-03-01 12:15:00', 6),
 (6, 2, '2018-03-01 13:00:00', 7), (7, 2, '2018-03-01 13:45:00', 1);
id |series_type |charttime |value |
---|------------|--------------------|------|
1 |1 |2018年03月01日 12:10:00 |40 |
2 |1 |2018年03月01日 13:25:00 |30 |
3 |1 |2018年03月01日 14:10:00 |50 |
4 |2 |2018年03月01日 11:20:00 |2 |
5 |2 |2018年03月01日 12:15:00 |6 |
7 |2 |2018年03月01日 13:45:00 |1 |
6 |2 |2018年03月01日 13:00:00 |7 |

Goal

The goal is to select one series together with the closest datapoint from another series. For the example dataset the result should be:

charttime |s1 |s2 |
--------------------|---|---|
2018年03月01日 12:10:00 |40 |6 |
2018年03月01日 13:25:00 |30 |1 |
2018年03月01日 14:10:00 |50 |1 |

First working approach

My current approach is to select the best matching data point from the other series by a subquery:

SELECT l.charttime, l.value AS s1,
 ( SELECT r.value
 FROM series r
 WHERE ABS( EXTRACT( EPOCH FROM l.charttime - r.charttime ) / 3600 ) < 1
 AND r.series_type = 2
 ORDER BY ABS( EXTRACT( EPOCH FROM l.charttime - r.charttime )) ASC LIMIT 1 
 ) AS s2
FROM series l
WHERE l.series_type = 1
ORDER BY l.charttime ASC

This does not seem to be the best approach as the dataset is quite huge and thus performing many subqueries slows down the query.

Second approach

A different idea is to self-join the table and filter for close data timestamps:

SELECT l.charttime, l.value AS s1, r.charttime, r.value AS s2
FROM series l, series r
WHERE abs(EXTRACT(EPOCH FROM l.charttime - r.charttime) / 3600) < 1
 AND l.series_type = 1 AND r.series_type = 2
charttime |s1 |charttime |s2 |
--------------------|---|--------------------|---|
2018年03月01日 12:10:00 |40 |2018年03月01日 11:20:00 |2 |
2018年03月01日 12:10:00 |40 |2018年03月01日 12:15:00 |6 |
2018年03月01日 12:10:00 |40 |2018年03月01日 13:00:00 |7 |
2018年03月01日 13:25:00 |30 |2018年03月01日 13:45:00 |1 |
2018年03月01日 13:25:00 |30 |2018年03月01日 13:00:00 |7 |
2018年03月01日 14:10:00 |50 |2018年03月01日 13:45:00 |1 |

The problem then are the duplicate data points. Grouping in the first column does not work as the best matching s2 cannot be selected.

Is there a better approach?

Question 2

Do you have a set amount of series? You say The goal is to select one series together with the closest datapoint from another series but in your example 40,30,50 all come from series_type=1. I am also confused as to how you have two datasets and three points in your desired output. Are either of these the desired output? Could you show/confirm the desired output given the input so we have a test case?

Question 3

In the displayed output samples, the values in column s2 come from series_type = 2.

Question 4

@EvanCarroll The example result is from the example series. There are three data points in series_type = 1. As pointed out, s2 are the matching points from series_type=2.

Question 5

In your second approach, with the self join, you could remove duplicates using row_number(),

Partition by l.charttime, order by the time difference and filter for row_number = 1.

I think performance will be horrible, however. Because of the cartesian join this will be an O(size(series 1) x size(series 2)) operation.

Having both l.charttime and r.charttime inside functions may also be causing trouble. Try refactoring to (in pseudo code)

 r.charttime < l.charttime + 3600
and r.charttime > l.charttime - 3600

.. and see how the query plan looks. I presume there's an index on charttime. Without one no approach will be fast. Indeed, two partial indexes, one on Series 1 and one on Series 2 may be even better.

Question 6

Your idea with parition by works quite will, it reduced the query time from around 20s to 1s. I have posted an answer with the code I ended up with.

Question 7

I'm reading your business rule as "get each row from series 1 and the closest row from series 2, if any, which must be within 3600 of series 1's value." I'm wondering if a good solution to this wouldn't be a cursor. Well, two cursors.

The basic algorithm is to find the two series 2 values that straddle each series 1 value i.e. the one just before and the one just after in time sequence. Then use whichever of these is closer. It would look something like this:

declare two variables: Smaller(datetime, value) and Larger(datetime, value).
initialize the variables to their domain minimum value e.g. (1900年01月01日 00:00:00, 0).
open a cursor on series 1, in time order
open a cursor on series 2, in time order
while rows remain in Series1
 while Larger.datetime < Series1.datetime
 Read next Series2
 set Smaller = Larger
 set Larger = Series2
 // Add logic for when Series2 is exhausted
 end
 // We know Smaller is less than Series1.datetime and Larger is greater than or equal,
 // or there's a case not documented in the question.
 // Check for Smaller, Larger within the 3600 window to be added.
 if (Series1.datetime - Smaller.datetime) < (Larger.datetime - Series1.datetime)
 use Smaller.value
 else
 use Larger.value
 end
 read next Series1
end

I've omitted a lot of the niceties, obviously. There will be some cases about having multiple series 1 values before the first series 2 value, or after the last series 2 value. Also if there is no match in series 2 for a given series 1 value. Your description does not mention the rules for these but I'm sure you'll be able to work them in.

This requires both series' values to be in time sequence. The sort could be expensive. However, there would have to be an index on this column for any solution to be workable. So the query would have a time-ordered access path into the data already, and, likely, there would be no actual sort at runtime.

The time complexity of this is O(size(series 1) + size(series 2)) i.e. O(N) rather than O(N^2) of the self-cross-join.

Question 8

Based on the idea from Michael Green I ended up with the following:

WITH c AS (
SELECT
 l.charttime,
 l.value AS s1,
 r.value AS s2,
 rank() OVER (PARTITION BY l.charttime ORDER BY abs(EXTRACT(EPOCH FROM l.charttime - r.charttime) / 3600) ASC) AS rnk
FROM
 series l, series r
WHERE
 abs(EXTRACT(EPOCH FROM l.charttime - r.charttime) / 3600) < 1
 AND l.series_type = 1
 AND r.series_type = 2
)
SELECT charttime, s1, s2 FROM c WHERE rnk = 1 ORDER BY charttime

The query time is around 1s compared to 20s in my original approach.

score 4 · Accepted Answer · 2018-07-04 20:04:49Z

In your second approach, with the self join, you could remove duplicates using row_number(),

Partition by l.charttime, order by the time difference and filter for row_number = 1.

I think performance will be horrible, however. Because of the cartesian join this will be an O(size(series 1) x size(series 2)) operation.

Having both l.charttime and r.charttime inside functions may also be causing trouble. Try refactoring to (in pseudo code)

 r.charttime < l.charttime + 3600
and r.charttime > l.charttime - 3600

.. and see how the query plan looks. I presume there's an index on charttime. Without one no approach will be fast. Indeed, two partial indexes, one on Series 1 and one on Series 2 may be even better.

Your idea with parition by works quite will, it reduced the query time from around 20s to 1s. I have posted an answer with the code I ended up with.

Stack Exchange Network

Computing Matching Data Points from Fuzzy Timestamps in Postgresql

Schema

Goal

First working approach

Second approach

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Computing Matching Data Points from Fuzzy Timestamps in Postgresql

Schema

Goal

First working approach

Second approach

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions