Part of our ERP system is a sub-system for running background jobs. We track a variety of meta-data about our jobs in a table including timestamps for submitted, started, and end times.
I'm creating a report showing the performance of our job system, detailed by day. One KPI is maximum # of jobs running at once. The algorithm I'm currently using is:
dim cnt as integer 'number of overlapping jobs
dim max as integer 'maximum number of jobs running at one time
for each Job where
Job.SubmitTs = today
'bJob is a 2nd instance of Job
for each bJob where
bJob.SubmitTs = today and
bJob.StartTs <= Job.EndTs and
bJob.EndTs >= Job.EndTs
cnt = cCnt + 1
next
if cnt > max then max = cnt
next
The problem with this algorithm is that it is very slow due to all the looping. I was wondering if there is a faster way to implement this?
Edit I cannot use SQL queries to access the data.
-
3If you post a schema of that table, we might be able to suggest a query that can get what you need.FrustratedWithFormsDesigner– FrustratedWithFormsDesigner2012年05月18日 15:50:48 +00:00Commented May 18, 2012 at 15:50
-
1I can't use a SQL query to get the data as the language doesn't support it.briddums– briddums2012年05月18日 16:19:58 +00:00Commented May 18, 2012 at 16:19
-
This was asked for and solved multiple times in SQL and outside of it. stackoverflow.com/questions/117962/… stackoverflow.com/questions/2037618/… stackoverflow.com/questions/1025688/… stackoverflow.com/questions/325933/…Job– Job2012年05月18日 16:20:22 +00:00Commented May 18, 2012 at 16:20
-
@Job 3 of your links point to SQL solutions which I cannot use. The 4th is dealing with comparing 2 date ranges. I am looking for a more efficient solution for comparing 1.2 million rangesbriddums– briddums2012年05月18日 16:32:27 +00:00Commented May 18, 2012 at 16:32
-
1@maple_shaft For codereview.SE there should be real code involved; this is just pseudocode as far as I get it, wrong at that (although it resembles VB). To the OP: if you can provide at least a language name, I can give you a better code snippet...K.Steff– K.Steff2012年05月18日 16:45:26 +00:00Commented May 18, 2012 at 16:45
2 Answers 2
- Include all the start and end points (in time) of the Jobs in an array (this creates 2*N elements (1 for start 1 for end))
- sort the array ordering by the timestamp of the event,
then iterate over the (2*N) elements as follows:
for each element X do if(X.type == start) counter++ else counter-- ans=max(ans, counter); end
Complexity: O(n.log(n)) for initial build of the sorted structure + O(n) for the iteration through its elements.
Edit: It has been suggested by Giorgio that an array is used, which is a better option. (The suggestion originally was to use a red/black tree, but no task removing capability seems to be needed, so maintaining order is trivial).
-
2+1: Cool idea! Wouldn't be possible to just sort all the start and end points in an array (this also takes O(n log(n))) and then scan the array as you have described?Giorgio– Giorgio2012年05月18日 16:56:40 +00:00Commented May 18, 2012 at 16:56
-
@Giorgio Actually yes, the OP never says tasks are to be removed, so actually a dynamic array is a better fit. Seems I unconsciously introduced further complexity :)K.Steff– K.Steff2012年05月18日 16:58:38 +00:00Commented May 18, 2012 at 16:58
-
Brilliant! Collapsing it into a single array made it super-fast.briddums– briddums2012年05月18日 17:34:40 +00:00Commented May 18, 2012 at 17:34
-
having the array contents already sorted seems like a prerequisite for the algorithm to work at allmatt b– matt b2013年07月09日 14:31:07 +00:00Commented Jul 9, 2013 at 14:31
SQL knows how to sort stuff, and I'd bet your engine uses the most efficient algorithm it can (i.e. O(n log(n))
), while still working if the result set doesn't fit in RAM (in contrast with K.Steff's answer which will fail if the array doesn't fit in RAM). In python :
import sqlite3
connection = sqlite3.connect("database.db")
cursor = connection.cursor()
cursor.execute("SELECT isStart from (" +
" SELECT startTime AS time, 1 AS isStart FROM myTable " +
" UNION ALL " +
" SELECT endTime as time, -1 AS isStart FROM myTable " +
" ORDER BY time ASC, isStart ASC" +
")")
maxOverlap = 0
currentOverlap = 0
for (isStart,) in cursor:
currentOverlap += isStart
maxOverlap = max(maxOverlap, currentOverlap)
print maxOverlap
Note the isStart ASC
order is necessary so the intervals [10,15]
and [15,23]
are not considered overlapping.
For further improvement, it is probably the case that the columns startTime and endTime have an index, so the SQL engine should efficiently merge the two already sorted tables, resulting in an O(n)
operation (the O(log(n))
factor then happens at insertion time for each row, when it is added to the index, but you already have it anyway). If your SQL engine isn't smart enough to do an efficient merge, do it yourself on the fly (still in python) :
import sqlite3
connection = sqlite3.connect("database.db")
cursorStart = connection.cursor()
cursorStart.execute("SELECT startTime FROM myTable ORDER BY startTime ASC")
cursorEnd = connection.cursor()
cursorEnd.execute("SELECT endTime FROM myTable ORDER BY endTime ASC")
maxOverlap = 0
currentOverlap = 0
currentStart = cursorStart.fetchone()
currentEnd = cursorEnd.fetchone()
while currentStart is not None and currentEnd is not None:
if currentStart < currentEnd:
currentOverlap += 1
currentStart = cursorStart.fetchone()
else:
currentOverlap -= 1
currentEnd = cursorEnd.fetchone()
maxOverlap = max(maxOverlap, currentOverlap)
print maxOverlap