Question: Suppose you are a given a csv with N lines of the form StudentID, score. Multiple lines can have same student-ID. Find the student with maximum average score
I can't think of any way to make it faster than O(N*logN).
Maintain a hash-table that keeps track of STudentID = Key and Value being an object with sum of scores and number of scores seen with that studentID --> O(N)
Create an array with values in hash table, maybe convert each entry into average while doing so
Sort it worst-case N(logN), all entries belong to same student
Is there any way to do it faster?
-
$\begingroup$ Note that hashtables don't usually have $O(1)$ worst-case performance. $\endgroup$Raphael– Raphael2016年09月28日 06:14:09 +00:00Commented Sep 28, 2016 at 6:14
-
4$\begingroup$ Finding maximums is trivially in $O(n)$. I don't get what your problem is. $\endgroup$Raphael– Raphael2016年09月28日 06:14:53 +00:00Commented Sep 28, 2016 at 6:14
2 Answers 2
You are on a good path, but seem to be confused for no reason. After step 2 you have created an array with average scores. This is a worst case O(N) size array. What is the complexity of finding the highest average score now? Do you need to sort the array?
-
$\begingroup$ What's the complexity if you insist on sorting the array, for no good reason, and what is the complexity if you focus on the actual problem, which is finding the highest score? $\endgroup$gnasher729– gnasher7292016年09月28日 10:05:58 +00:00Commented Sep 28, 2016 at 10:05
@megas has the right approach. All I want to add is another, yet more involved solution:
- you can calculate averages online, instead of computing it at the end. You need to maintain the count and average. For every score update and before updating the count,
new_avg = (old_avg*count + score)/(count + 1)
. Then update the count. - now that you have all student averages at any point in time, you can take advantage of order-maintenance solutions to keep your set of student scores ordered at all times.
Note that by doing this, you're paying the time tax on keeping all student scores sorted. That's why @megas has the better solution because you're interested in the max score.