Median Calculation of List of Integers without using heap

Question 1

References

Using those references:

Code

A code was written to calculate the Median with the SortedContainers library:

from itertools import islice
from sortedcontainers import SortedList
import random
import time
start_time = time.time()
class Median(object):
 def __init__(self, iterable):
 self._iterable = islice(iterable, None)
 self._sortedlist = SortedList(self._iterable)
 def __iter__(self):
 self_sortedlist = self._sortedlist
 # print(self_sortedlist)
 length = len(self_sortedlist)
 half = length // 2
 if length % 2 == 0:
 yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
 elif length % 2 == 1:
 yield self_sortedlist[half]
def main():
 m, n = 1000, 1500000
 data = [random.randrange(m) for i in range(n)]
 # print("Random Data: ", data)
 result = list(Median(data))
 print("Result: ", result)
if __name__ == "__main__":
 main()
 print("--- %s seconds ---" % (time.time() - start_time))

Explanation

Random Number Generator

The following code generates data within the Range m and quantity n.

m, n = 1000, 15000000
data = [random.randrange(m) for i in range(n)]

Median

The Median class sorts the list of numbers and if n is odd, returns the middle item with yield self_sortedlist[half]. Or if n is even, returns the mean of two middle itens of the list with yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2

Question

How do I improve the code performance? Because for a large list (100 milion), it takes --- 186.7168517112732 seconds--- on my computer.

Question 2

sorting has time complexity O(n log n). You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity.

Question 3

Why do you use integer division for a list with even length? Shouldn't it be a regular division of the two center values in the sorted list?

Question 4

I think there is a misunderstanding; the use of heaps and such clever data structures is needed to maintain a rolling median, i.e. you have to return the current median after every number in the input. If you want to call it only once on the final list, then a simple quickselect is enough - that gets you to O(n) time. (If you were to use this algorithm for the rolling mean case, you would need O(1+2+..+n) = O(n^2) time, which is why heaps and other structures are used.)

Question 5

Timsort is a sorting algorithm, and it takes O(n log n) time. Quickselect is not a sorting algorithm, and it takes O(n) time. Quickselect is much faster than timsort for this use case; the key observation is that to get the median you don't need to sort the whole vector, just as if you want to compute the minimum you don't need to sort the whole vector, you can just do one pass over the elements. With the median (and in general with order statistics) it's more complicated, so you use quickselect, but it still has a O(n) performance, much better than your sorting approach which is O(n log n)

Question 6

@DanielLenz Oh, It was to reuse the number as Integer. Just as the data type of the input. However, if the precision of the program is needed for decimal numbers, then it can be done.

Question 7

Performance

I instrumented your code differently:

def main():
 m, n = 1000, 15000000
 start_time = time.time()
 data = [random.randrange(m) for i in range(n)]
 print("--- %s seconds to generate" % (time.time() - start_time))
 # print("Random Data: ", data)
 start_time = time.time()
 result = SortedList(data)
 print("--- %s seconds to sort" % (time.time() - start_time))
 start_time = time.time()
 result = list(Median(data))
 print("Result: ", result)
 print("--- %s seconds ---" % (time.time() - start_time))

For me, the results were:

--- 7.407598257064819 seconds to generate
--- 4.535749673843384 seconds to sort
Result: [500]
--- 5.01109504699707 seconds ---

This shows that the generation of the random input takes longer than finding the median, and that 90% of Median() is spent sorting (probably most of the rest is caused by converting between lists and iterators). You're unlikely to get large gains through modifications to your own part of the code.

I got better results by using Python's built-in sorted(). We don't need any of the extra functionality of SortedList (maintaining the invariant through adds and deletes), and a single sort (mostly in native code) gives these results:

def median(values):
 sortedlist = sorted(values)
 length = len(sortedlist)
 half = length // 2
 # return not yield; see "General review"
 if length % 2 == 0:
 return (sortedlist[half] + sortedlist[half - 1]) // 2
 else:
 return sortedlist[half]
def main():
 m, n = 1000, 15000000
 start_time = time.time()
 data = [random.randrange(m) for i in range(n)]
 print("--- %s seconds to generate" % (time.time() - start_time))
 # print("Random Data: ", data)
 start_time = time.time()
 result = sorted(data)
 print("--- %s seconds to sort" % (time.time() - start_time))
 start_time = time.time()
 result = median(data)
 print("Result: ", result)
 print("--- %s seconds ---" % (time.time() - start_time))

--- 7.638948202133179 seconds to generate
--- 3.118924617767334 seconds to sort
Result: 500
--- 3.3397886753082275 seconds ---

General review

I don't see why you yield a single value, rather than simply returning. Is there a need for Median() to be iterable?

Given that length is an integer, if length % 2 is not 0, it must be 1 - so the elif could easily be else.

Question 8

You are right on the General Review, yield is only necessary for the running median, so i changed to return and the result to result = Median(data). Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?

Question 9

Yes, use Python's built-in sort for a one-time sorting of a list - see edit.

Question 10

If you want to time something, you should time just that. When I ran your original code, I got 6.023478031158447 seconds. When I instead did

start_time = time.time()
result = list(Median(data))
end_time = time.time()
print("Result: ", result, end_time-start_time)

I got 1.8368241786956787. Note that calling time.time() before the print statement means that the execution of the print statement isn't included in your calculation of elapsed time.

I also tried

import statistics
start_time = time.time()
result = statistics.median(data)
end_time = time.time()
print("Result: ", result, end_time-start_time)

and got 0.7475762367248535.

What do mean by "not using heap", and why do you want that? What makes you think SortedList doesn't use a heap? If you're going to use pre-made functions, why not just use the pre-made median function?

Question 11

Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time.

Question 12

Thanks to Toby now the program takes:

--- 7.169409990310669 seconds to finish the program--- instead of --- 186.7168517112732 seconds--- for 100 milion numbers.

One other improvement I made was to add an extra optimization on the random generating algorithm, using numpy. So numpy.sort() was used to sort the numpy array at the start of median(), because it is faster for numpy arrays.

Code

import time
import numpy
start_time0 = time.time()
def median(values):
 sortedlist = numpy.sort(values)
 length = len(sortedlist)
 half = length // 2
 if length % 2 == 0:
 return (sortedlist[half] + sortedlist[half - 1]) // 2
 else:
 return sortedlist[half]
def main():
 m, n = 1000, 100000000
 # Random Number Generator
 start_time = time.time()
 data = numpy.random.randint(1, m, n)
 print("--- %s seconds to numpy random---" % (time.time() - start_time))
 # Median
 start_time = time.time()
 result = median(data)
 print("Result: ", result)
 print("--- %s seconds to find the Median---" % (time.time() - start_time))
if __name__ == "__main__":
 main()
 print("--- %s seconds to finish the program---" % (time.time() - start_time0))

Question 13

In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :)

Question 14

Oh forgive us - it was difficult to see the implicit change. Hopefully my edit makes it easier for others to see

Question 15

just a note : you can enhance just a tidbit by defining the name in the list comprehension

from

data = [random.randrange(m) for i in range(n)]

to

data = [random.randrange(m) for num in range(n)]

Toby Speight Toby Speight 87.1k14 gold badges104 silver badges322 bronze badges · Accepted Answer · 2018-08-15 13:39:13Z

Performance

I instrumented your code differently:

def main():
 m, n = 1000, 15000000
 start_time = time.time()
 data = [random.randrange(m) for i in range(n)]
 print("--- %s seconds to generate" % (time.time() - start_time))
 # print("Random Data: ", data)
 start_time = time.time()
 result = SortedList(data)
 print("--- %s seconds to sort" % (time.time() - start_time))
 start_time = time.time()
 result = list(Median(data))
 print("Result: ", result)
 print("--- %s seconds ---" % (time.time() - start_time))

For me, the results were:

--- 7.407598257064819 seconds to generate
--- 4.535749673843384 seconds to sort
Result: [500]
--- 5.01109504699707 seconds ---

This shows that the generation of the random input takes longer than finding the median, and that 90% of Median() is spent sorting (probably most of the rest is caused by converting between lists and iterators). You're unlikely to get large gains through modifications to your own part of the code.

I got better results by using Python's built-in sorted(). We don't need any of the extra functionality of SortedList (maintaining the invariant through adds and deletes), and a single sort (mostly in native code) gives these results:

def median(values):
 sortedlist = sorted(values)
 length = len(sortedlist)
 half = length // 2
 # return not yield; see "General review"
 if length % 2 == 0:
 return (sortedlist[half] + sortedlist[half - 1]) // 2
 else:
 return sortedlist[half]
def main():
 m, n = 1000, 15000000
 start_time = time.time()
 data = [random.randrange(m) for i in range(n)]
 print("--- %s seconds to generate" % (time.time() - start_time))
 # print("Random Data: ", data)
 start_time = time.time()
 result = sorted(data)
 print("--- %s seconds to sort" % (time.time() - start_time))
 start_time = time.time()
 result = median(data)
 print("Result: ", result)
 print("--- %s seconds ---" % (time.time() - start_time))

--- 7.638948202133179 seconds to generate
--- 3.118924617767334 seconds to sort
Result: 500
--- 3.3397886753082275 seconds ---

General review

I don't see why you yield a single value, rather than simply returning. Is there a need for Median() to be iterable?

Given that length is an integer, if length % 2 is not 0, it must be 1 - so the elif could easily be else.

You are right on the General Review, yield is only necessary for the running median, so i changed to return and the result to result = Median(data). Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster?
Yes, use Python's built-in sort for a one-time sorting of a list - see edit.

Stack Exchange Network

Median Calculation of List of Integers without using heap

References

Code

Explanation

Random Number Generator

Median

Question

4 Answers 4

Performance

General review

Code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Median Calculation of List of Integers without using heap

References

Code

Explanation

Random Number Generator

Median

Question

4 Answers 4

Performance

General review

Code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions