Faster Query to Track State changes in a Large Table

Question 1

I have a table that tracks state changes of an entity with timestamps. The schema of the table is as follows:

entityid	userid	originalvalue	newvalue	changetime
255	101	Draft	Submitted	2023年11月28日 12:10:56.389111
255	102	Submitted	Received	2023年12月04日 10:30:02.776323
255	103	Received	Audited	2024年01月07日 09:50:21.700231
255	104	Audited	Paid	2024年01月18日 13:19:06.805542
255	105	Paid	In Progress	2024年02月04日 12:10:58.670298
255	105	In Progress	Require Company Response	2024年02月04日 12:44:03.201730
255	105	Require Company Response	In Progress	2024年05月19日 11:37:08.419742
255	106	In Progress	Ready for primary registration	2024年05月19日 11:37:53.641486
255	107	Ready for primary registration	In-session	2024年05月20日 10:37:48.640948
255	107	In-session	Ready for primary registration	2024年05月20日 11:45:44.560424
255	107	Ready for primary registration	In-session	2024年05月21日 10:06:47.653381
255	107	In-session	Primary Registered	2024年06月06日 15:29:22.813432
255	108	Primary Registered	Require Company Response	2024年06月09日 14:21:12.664475
255	102	Require Company Response	In Progress	2024年06月24日 08:28:55.994459

I want to compare the time it takes for the originalvalue to transition to the newvalue. For example, if the originalvalue is "Draft" and it becomes "Submitted", I need to find the corresponding newvalue in the next row and calculate the time between the originalvalue and newvalue changes.

To achieve this, I joined the table with itself on entityid where t1.newvalue = t2.originalvalue and t1.changetime <= t2.changetime:

 SELECT t1.entityid,
 t1.userid,
 t1.changetime as startStateTime
 t1.originalvalue as StartingState,
 t2.originalvalue as NextState,
 t2.changetime as nextStateTime
 FROM table1 t1
 INNER JOIN table1 t2
 ON t1.entityid = t2.entityid
 AND t1.newvalue = t2.originalvalue
 and t1.changetime <= t2.changetime

However, this query sometimes results in duplicate rows where the same state transition appears multiple times with different timestamps. For example, the state "In Progress" appears multiple times, causing duplicates.

entityid	userid	StartStateTime	StartingState	NextState	NextStateTime
255	101	2023年11月28日 12:10:56.389111	"Draft"	"Submitted"	2023年12月04日 10:30:02.776323
255	102	2023年12月04日 10:30:02.776323	"Submitted"	"Received"	2024年01月07日 09:50:21.700231
255	103	2024年01月07日 09:50:21.700231	"Received"	"Audited"	2024年01月18日 13:19:06.805542
255	104	2024年01月18日 13:19:06.805542	"Audited"	"Paid"	2024年02月04日 12:10:58.670298
255	105	2024年02月04日 12:10:58.670298	"Paid"	"In Progress"	2024年02月04日 12:44:03.201730
255	105	2024年02月04日 12:44:03.201730	"In Progress"	"Require Company Response"	2024年05月19日 11:37:08.419742
255	105	2024年02月04日 12:10:58.670298	"Paid"	"In Progress"	2024年05月19日 11:37:53.641486
255	105	2024年05月19日 11:37:08.419742	"Require Company Response"	"In Progress"	2024年05月19日 11:37:53.641486
255	106	2024年05月19日 11:37:53.641486	"In Progress"	"Ready for primary registration"	2024年05月20日 10:37:48.640948
255	107	2024年05月20日 10:37:48.640948	"Ready for primary registration"	"In-session"	2024年05月20日 11:45:44.560424
255	106	2024年05月19日 11:37:53.641486	"In Progress"	"Ready for primary registration"	2024年05月21日 10:06:47.653381
255	107	2024年05月20日 11:45:44.560424	"In-session"	"Ready for primary registration"	2024年05月21日 10:06:47.653381
255	107	2024年05月20日 10:37:48.640948	"Ready for primary registration"	"In-session"	2024年06月06日 15:29:22.813432
255	107	2024年05月21日 10:06:47.653381	"Ready for primary registration"	"In-session"	2024年06月06日 15:29:22.813432
255	107	2024年06月06日 15:29:22.813432	"In-session"	"Primary Registered"	2024年06月09日 14:21:12.664475
255	105	2024年02月04日 12:44:03.201730	"In Progress"	"Require Company Response"	2024年06月24日 08:28:55.994459
255	108	2024年06月09日 14:21:12.664475	"Primary Registered"	"Require Company Response"	2024年06月24日 08:28:55.994459

To filter these duplicates, I used the ROW_NUMBER() window function:

what i did it user row number

 WITH NumberedStates AS (
 SELECT t1.entityid,
 t1.userid,
 t1.changetime AS startStateTime,
 t1.originalvalue AS StartingState,
 t2.originalvalue AS NextState,
 t2.changetime AS nextStateTime,
 ROW_NUMBER() OVER (
 PARTITION BY t1.entityid, t1.originalvalue, t2.originalvalue
 ORDER BY t2.changetime ASC
 ) AS row_number
 FROM table1 t1
 INNER JOIN table1 t2
 ON t1.entityid = t2.entityid
 AND t1.newvalue = t2.originalvalue
 AND t1.changetime <= t2.changetime
)
SELECT entityid, userid, startStateTime, StartingState, NextState, nextStateTime
FROM NumberedStates 
WHERE row_number = 1

Although this works, the table is very large (millions of rows), and the query is slow. Is there a more efficient way to achieve the desired result or optimize this approach?

Question 2

Your originalvalue column is denormed and is perhaps a distraction. What really matters is newvalue, along with its associated timestamp (which is already normalized in your schema).

Do a quick GROUP BY changetime scan of the table to view any rows HAVING COUNT(*) > 1. If they're all unique, great. Otherwise we can add a few microseconds of random fuzz, or tack on a random GUID column to uniquify those timestamps.

Produce a new "report" relation with PK of (entityid, changetime), preserving all columns of interest while discarding originalvalue. Generate a new seq column with sequential integers from 1 up to the few million rows you have. Create a compound index on (entityid, seq).

Now you're set up to do another (efficient!) self join. In the ON clause it's trivially easy to predict the predecessor row: we simply subtract 1. The query completes quickly, while the unique sequence number avoids duplicates. Use EXPLAIN SELECT ... to verify the query plan is exploiting that new index. (Notice that we're using numbers that are sequential within an entityid, not sequential across global timestamps.)

You can readily accomplish the same effect without allocating any new DB storage, if you're willing to write a tiny amount of procedural code. Simply ORDER BY entityid, changetime, hang onto a prev row, and output the delta times as required.

Assuming that a given entity has a "small" number of state changes, and that entityid is indexed, we can use this to quickly report on entities of interest. With an index on the timestamp column we'd also be able to quickly identify the small collection of entities that recently changed state.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Accepted Answer · 2024-08-13 14:59:04Z

Your originalvalue column is denormed and is perhaps a distraction. What really matters is newvalue, along with its associated timestamp (which is already normalized in your schema).

Do a quick GROUP BY changetime scan of the table to view any rows HAVING COUNT(*) > 1. If they're all unique, great. Otherwise we can add a few microseconds of random fuzz, or tack on a random GUID column to uniquify those timestamps.

Produce a new "report" relation with PK of (entityid, changetime), preserving all columns of interest while discarding originalvalue. Generate a new seq column with sequential integers from 1 up to the few million rows you have. Create a compound index on (entityid, seq).

Now you're set up to do another (efficient!) self join. In the ON clause it's trivially easy to predict the predecessor row: we simply subtract 1. The query completes quickly, while the unique sequence number avoids duplicates. Use EXPLAIN SELECT ... to verify the query plan is exploiting that new index. (Notice that we're using numbers that are sequential within an entityid, not sequential across global timestamps.)

You can readily accomplish the same effect without allocating any new DB storage, if you're willing to write a tiny amount of procedural code. Simply ORDER BY entityid, changetime, hang onto a prev row, and output the delta times as required.

Assuming that a given entity has a "small" number of state changes, and that entityid is indexed, we can use this to quickly report on entities of interest. With an index on the timestamp column we'd also be able to quickly identify the small collection of entities that recently changed state.

Stack Exchange Network

Faster Query to Track State changes in a Large Table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Faster Query to Track State changes in a Large Table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions