3
\$\begingroup\$

I have a table that tracks state changes of an entity with timestamps. The schema of the table is as follows:

entityid userid originalvalue newvalue changetime
255 101 Draft Submitted 2023年11月28日 12:10:56.389111
255 102 Submitted Received 2023年12月04日 10:30:02.776323
255 103 Received Audited 2024年01月07日 09:50:21.700231
255 104 Audited Paid 2024年01月18日 13:19:06.805542
255 105 Paid In Progress 2024年02月04日 12:10:58.670298
255 105 In Progress Require Company Response 2024年02月04日 12:44:03.201730
255 105 Require Company Response In Progress 2024年05月19日 11:37:08.419742
255 106 In Progress Ready for primary registration 2024年05月19日 11:37:53.641486
255 107 Ready for primary registration In-session 2024年05月20日 10:37:48.640948
255 107 In-session Ready for primary registration 2024年05月20日 11:45:44.560424
255 107 Ready for primary registration In-session 2024年05月21日 10:06:47.653381
255 107 In-session Primary Registered 2024年06月06日 15:29:22.813432
255 108 Primary Registered Require Company Response 2024年06月09日 14:21:12.664475
255 102 Require Company Response In Progress 2024年06月24日 08:28:55.994459

I want to compare the time it takes for the originalvalue to transition to the newvalue. For example, if the originalvalue is "Draft" and it becomes "Submitted", I need to find the corresponding newvalue in the next row and calculate the time between the originalvalue and newvalue changes.

To achieve this, I joined the table with itself on entityid where t1.newvalue = t2.originalvalue and t1.changetime <= t2.changetime:

 SELECT t1.entityid,
 t1.userid,
 t1.changetime as startStateTime
 t1.originalvalue as StartingState,
 t2.originalvalue as NextState,
 t2.changetime as nextStateTime
 FROM table1 t1
 INNER JOIN table1 t2
 ON t1.entityid = t2.entityid
 AND t1.newvalue = t2.originalvalue
 and t1.changetime <= t2.changetime

However, this query sometimes results in duplicate rows where the same state transition appears multiple times with different timestamps. For example, the state "In Progress" appears multiple times, causing duplicates.

entityid userid StartStateTime StartingState NextState NextStateTime
255 101 2023年11月28日 12:10:56.389111 "Draft" "Submitted" 2023年12月04日 10:30:02.776323
255 102 2023年12月04日 10:30:02.776323 "Submitted" "Received" 2024年01月07日 09:50:21.700231
255 103 2024年01月07日 09:50:21.700231 "Received" "Audited" 2024年01月18日 13:19:06.805542
255 104 2024年01月18日 13:19:06.805542 "Audited" "Paid" 2024年02月04日 12:10:58.670298
255 105 2024年02月04日 12:10:58.670298 "Paid" "In Progress" 2024年02月04日 12:44:03.201730
255 105 2024年02月04日 12:44:03.201730 "In Progress" "Require Company Response" 2024年05月19日 11:37:08.419742
255 105 2024年02月04日 12:10:58.670298 "Paid" "In Progress" 2024年05月19日 11:37:53.641486
255 105 2024年05月19日 11:37:08.419742 "Require Company Response" "In Progress" 2024年05月19日 11:37:53.641486
255 106 2024年05月19日 11:37:53.641486 "In Progress" "Ready for primary registration" 2024年05月20日 10:37:48.640948
255 107 2024年05月20日 10:37:48.640948 "Ready for primary registration" "In-session" 2024年05月20日 11:45:44.560424
255 106 2024年05月19日 11:37:53.641486 "In Progress" "Ready for primary registration" 2024年05月21日 10:06:47.653381
255 107 2024年05月20日 11:45:44.560424 "In-session" "Ready for primary registration" 2024年05月21日 10:06:47.653381
255 107 2024年05月20日 10:37:48.640948 "Ready for primary registration" "In-session" 2024年06月06日 15:29:22.813432
255 107 2024年05月21日 10:06:47.653381 "Ready for primary registration" "In-session" 2024年06月06日 15:29:22.813432
255 107 2024年06月06日 15:29:22.813432 "In-session" "Primary Registered" 2024年06月09日 14:21:12.664475
255 105 2024年02月04日 12:44:03.201730 "In Progress" "Require Company Response" 2024年06月24日 08:28:55.994459
255 108 2024年06月09日 14:21:12.664475 "Primary Registered" "Require Company Response" 2024年06月24日 08:28:55.994459

To filter these duplicates, I used the ROW_NUMBER() window function:

what i did it user row number

 WITH NumberedStates AS (
 SELECT t1.entityid,
 t1.userid,
 t1.changetime AS startStateTime,
 t1.originalvalue AS StartingState,
 t2.originalvalue AS NextState,
 t2.changetime AS nextStateTime,
 ROW_NUMBER() OVER (
 PARTITION BY t1.entityid, t1.originalvalue, t2.originalvalue
 ORDER BY t2.changetime ASC
 ) AS row_number
 FROM table1 t1
 INNER JOIN table1 t2
 ON t1.entityid = t2.entityid
 AND t1.newvalue = t2.originalvalue
 AND t1.changetime <= t2.changetime
)
SELECT entityid, userid, startStateTime, StartingState, NextState, nextStateTime
FROM NumberedStates 
WHERE row_number = 1

Although this works, the table is very large (millions of rows), and the query is slow. Is there a more efficient way to achieve the desired result or optimize this approach?

toolic
14.6k5 gold badges29 silver badges204 bronze badges
asked Aug 13, 2024 at 6:13
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

Your originalvalue column is denormed and is perhaps a distraction. What really matters is newvalue, along with its associated timestamp (which is already normalized in your schema).

Do a quick GROUP BY changetime scan of the table to view any rows HAVING COUNT(*) > 1. If they're all unique, great. Otherwise we can add a few microseconds of random fuzz, or tack on a random GUID column to uniquify those timestamps.

Produce a new "report" relation with PK of (entityid, changetime), preserving all columns of interest while discarding originalvalue. Generate a new seq column with sequential integers from 1 up to the few million rows you have. Create a compound index on (entityid, seq).

Now you're set up to do another (efficient!) self join. In the ON clause it's trivially easy to predict the predecessor row: we simply subtract 1. The query completes quickly, while the unique sequence number avoids duplicates. Use EXPLAIN SELECT ... to verify the query plan is exploiting that new index. (Notice that we're using numbers that are sequential within an entityid, not sequential across global timestamps.)


You can readily accomplish the same effect without allocating any new DB storage, if you're willing to write a tiny amount of procedural code. Simply ORDER BY entityid, changetime, hang onto a prev row, and output the delta times as required.

Assuming that a given entity has a "small" number of state changes, and that entityid is indexed, we can use this to quickly report on entities of interest. With an index on the timestamp column we'd also be able to quickly identify the small collection of entities that recently changed state.

answered Aug 13, 2024 at 14:59
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.