I have a table that tracks state changes of an entity with timestamps. The schema of the table is as follows:
entityid | userid | originalvalue | newvalue | changetime |
---|---|---|---|---|
255 | 101 | Draft | Submitted | 2023年11月28日 12:10:56.389111 |
255 | 102 | Submitted | Received | 2023年12月04日 10:30:02.776323 |
255 | 103 | Received | Audited | 2024年01月07日 09:50:21.700231 |
255 | 104 | Audited | Paid | 2024年01月18日 13:19:06.805542 |
255 | 105 | Paid | In Progress | 2024年02月04日 12:10:58.670298 |
255 | 105 | In Progress | Require Company Response | 2024年02月04日 12:44:03.201730 |
255 | 105 | Require Company Response | In Progress | 2024年05月19日 11:37:08.419742 |
255 | 106 | In Progress | Ready for primary registration | 2024年05月19日 11:37:53.641486 |
255 | 107 | Ready for primary registration | In-session | 2024年05月20日 10:37:48.640948 |
255 | 107 | In-session | Ready for primary registration | 2024年05月20日 11:45:44.560424 |
255 | 107 | Ready for primary registration | In-session | 2024年05月21日 10:06:47.653381 |
255 | 107 | In-session | Primary Registered | 2024年06月06日 15:29:22.813432 |
255 | 108 | Primary Registered | Require Company Response | 2024年06月09日 14:21:12.664475 |
255 | 102 | Require Company Response | In Progress | 2024年06月24日 08:28:55.994459 |
I want to compare the time it takes for the originalvalue
to transition to the newvalue
. For example, if the originalvalue
is "Draft" and it becomes "Submitted", I need to find the corresponding newvalue
in the next row and calculate the time between the originalvalue
and newvalue
changes.
To achieve this, I joined the table with itself on entityid
where t1.newvalue = t2.originalvalue
and t1.changetime <= t2.changetime
:
SELECT t1.entityid,
t1.userid,
t1.changetime as startStateTime
t1.originalvalue as StartingState,
t2.originalvalue as NextState,
t2.changetime as nextStateTime
FROM table1 t1
INNER JOIN table1 t2
ON t1.entityid = t2.entityid
AND t1.newvalue = t2.originalvalue
and t1.changetime <= t2.changetime
However, this query sometimes results in duplicate rows where the same state transition appears multiple times with different timestamps. For example, the state "In Progress" appears multiple times, causing duplicates.
entityid | userid | StartStateTime | StartingState | NextState | NextStateTime |
---|---|---|---|---|---|
255 | 101 | 2023年11月28日 12:10:56.389111 | "Draft" | "Submitted" | 2023年12月04日 10:30:02.776323 |
255 | 102 | 2023年12月04日 10:30:02.776323 | "Submitted" | "Received" | 2024年01月07日 09:50:21.700231 |
255 | 103 | 2024年01月07日 09:50:21.700231 | "Received" | "Audited" | 2024年01月18日 13:19:06.805542 |
255 | 104 | 2024年01月18日 13:19:06.805542 | "Audited" | "Paid" | 2024年02月04日 12:10:58.670298 |
255 | 105 | 2024年02月04日 12:10:58.670298 | "Paid" | "In Progress" | 2024年02月04日 12:44:03.201730 |
255 | 105 | 2024年02月04日 12:44:03.201730 | "In Progress" | "Require Company Response" | 2024年05月19日 11:37:08.419742 |
255 | 105 | 2024年02月04日 12:10:58.670298 | "Paid" | "In Progress" | 2024年05月19日 11:37:53.641486 |
255 | 105 | 2024年05月19日 11:37:08.419742 | "Require Company Response" | "In Progress" | 2024年05月19日 11:37:53.641486 |
255 | 106 | 2024年05月19日 11:37:53.641486 | "In Progress" | "Ready for primary registration" | 2024年05月20日 10:37:48.640948 |
255 | 107 | 2024年05月20日 10:37:48.640948 | "Ready for primary registration" | "In-session" | 2024年05月20日 11:45:44.560424 |
255 | 106 | 2024年05月19日 11:37:53.641486 | "In Progress" | "Ready for primary registration" | 2024年05月21日 10:06:47.653381 |
255 | 107 | 2024年05月20日 11:45:44.560424 | "In-session" | "Ready for primary registration" | 2024年05月21日 10:06:47.653381 |
255 | 107 | 2024年05月20日 10:37:48.640948 | "Ready for primary registration" | "In-session" | 2024年06月06日 15:29:22.813432 |
255 | 107 | 2024年05月21日 10:06:47.653381 | "Ready for primary registration" | "In-session" | 2024年06月06日 15:29:22.813432 |
255 | 107 | 2024年06月06日 15:29:22.813432 | "In-session" | "Primary Registered" | 2024年06月09日 14:21:12.664475 |
255 | 105 | 2024年02月04日 12:44:03.201730 | "In Progress" | "Require Company Response" | 2024年06月24日 08:28:55.994459 |
255 | 108 | 2024年06月09日 14:21:12.664475 | "Primary Registered" | "Require Company Response" | 2024年06月24日 08:28:55.994459 |
To filter these duplicates, I used the ROW_NUMBER()
window function:
what i did it user row number
WITH NumberedStates AS (
SELECT t1.entityid,
t1.userid,
t1.changetime AS startStateTime,
t1.originalvalue AS StartingState,
t2.originalvalue AS NextState,
t2.changetime AS nextStateTime,
ROW_NUMBER() OVER (
PARTITION BY t1.entityid, t1.originalvalue, t2.originalvalue
ORDER BY t2.changetime ASC
) AS row_number
FROM table1 t1
INNER JOIN table1 t2
ON t1.entityid = t2.entityid
AND t1.newvalue = t2.originalvalue
AND t1.changetime <= t2.changetime
)
SELECT entityid, userid, startStateTime, StartingState, NextState, nextStateTime
FROM NumberedStates
WHERE row_number = 1
Although this works, the table is very large (millions of rows), and the query is slow. Is there a more efficient way to achieve the desired result or optimize this approach?
1 Answer 1
Your originalvalue
column is
denormed
and is perhaps a distraction.
What really matters is newvalue
, along with
its associated timestamp (which is already normalized in your schema).
Do a quick GROUP BY changetime
scan of the table
to view any rows HAVING COUNT(*) > 1
.
If they're all unique, great.
Otherwise we can add a few microseconds of random fuzz,
or tack on a random GUID column to uniquify those timestamps.
Produce a new "report" relation with
PK
of (entityid, changetime)
, preserving all columns of interest
while discarding originalvalue
.
Generate a new seq
column with sequential integers from 1
up to the few million rows you have.
Create a compound index on (entityid, seq)
.
Now you're set up to do another (efficient!) self join.
In the ON
clause it's trivially easy to predict the predecessor row:
we simply subtract 1
.
The query completes quickly,
while the unique sequence number avoids duplicates.
Use EXPLAIN SELECT ...
to verify the query plan is
exploiting that new index.
(Notice that we're using numbers that are sequential
within an entityid
, not sequential across global timestamps.)
You can readily accomplish the same effect without allocating any
new DB storage, if you're willing to write a tiny amount of procedural code.
Simply ORDER BY entityid, changetime
,
hang onto a prev
row, and output the delta times as required.
Assuming that a given entity has a "small" number of state changes,
and that entityid
is indexed, we can use this to quickly report
on entities of interest.
With an index on the timestamp column we'd also be able
to quickly identify the small collection of entities that
recently changed state.