References
Using those references:
- rolling median implementations benchmark
- FAST RUNNING MEDIAN USING AN INDEXABLE SKIPLIST
- Calculate the median of a billion numbers
- Find running median from a stream of integers
Code
A code was written to calculate the Median with the SortedContainers library:
from itertools import islice
from sortedcontainers import SortedList
import random
import time
start_time = time.time()
class Median(object):
def __init__(self, iterable):
self._iterable = islice(iterable, None)
self._sortedlist = SortedList(self._iterable)
def __iter__(self):
self_sortedlist = self._sortedlist
# print(self_sortedlist)
length = len(self_sortedlist)
half = length // 2
if length % 2 == 0:
yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
elif length % 2 == 1:
yield self_sortedlist[half]
def main():
m, n = 1000, 1500000
data = [random.randrange(m) for i in range(n)]
# print("Random Data: ", data)
result = list(Median(data))
print("Result: ", result)
if __name__ == "__main__":
main()
print("--- %s seconds ---" % (time.time() - start_time))
Explanation
Random Number Generator
The following code generates data within the Range m
and quantity n
.
m, n = 1000, 15000000
data = [random.randrange(m) for i in range(n)]
Median
The Median class sorts the list of numbers and if n
is odd, returns the middle item with yield self_sortedlist[half]
. Or if n
is even, returns the mean of two middle itens of the list with yield (self_sortedlist[half] + self_sortedlist[half - 1]) // 2
Question
How do I improve the code performance? Because for a large list (100 milion), it takes --- 186.7168517112732 seconds---
on my computer.
4 Answers 4
Performance
I instrumented your code differently:
def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = SortedList(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = list(Median(data))
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))
For me, the results were:
--- 7.407598257064819 seconds to generate
--- 4.535749673843384 seconds to sort
Result: [500]
--- 5.01109504699707 seconds ---
This shows that the generation of the random input takes longer than finding the median, and that 90% of Median()
is spent sorting (probably most of the rest is caused by converting between lists and iterators). You're unlikely to get large gains through modifications to your own part of the code.
I got better results by using Python's built-in sorted()
. We don't need any of the extra functionality of SortedList
(maintaining the invariant through adds and deletes), and a single sort (mostly in native code) gives these results:
def median(values):
sortedlist = sorted(values)
length = len(sortedlist)
half = length // 2
# return not yield; see "General review"
if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]
def main():
m, n = 1000, 15000000
start_time = time.time()
data = [random.randrange(m) for i in range(n)]
print("--- %s seconds to generate" % (time.time() - start_time))
# print("Random Data: ", data)
start_time = time.time()
result = sorted(data)
print("--- %s seconds to sort" % (time.time() - start_time))
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds ---" % (time.time() - start_time))
--- 7.638948202133179 seconds to generate
--- 3.118924617767334 seconds to sort
Result: 500
--- 3.3397886753082275 seconds ---
General review
I don't see why you yield
a single value, rather than simply returning. Is there a need for Median()
to be iterable?
Given that length
is an integer, if length % 2
is not 0, it must be 1 - so the elif
could easily be else
.
-
\$\begingroup\$ You are right on the General Review, yield is only necessary for the running median, so i changed to
return
and the result toresult = Median(data)
. Else was used instead of elif. Is there any algorithm that generates the list on an improved way, faster? \$\endgroup\$danieltakeshi– danieltakeshi2018年08月15日 13:52:06 +00:00Commented Aug 15, 2018 at 13:52 -
\$\begingroup\$ Yes, use Python's built-in sort for a one-time sorting of a list - see edit. \$\endgroup\$Toby Speight– Toby Speight2018年08月15日 14:20:10 +00:00Commented Aug 15, 2018 at 14:20
If you want to time something, you should time just that. When I ran your original code, I got 6.023478031158447 seconds. When I instead did
start_time = time.time()
result = list(Median(data))
end_time = time.time()
print("Result: ", result, end_time-start_time)
I got 1.8368241786956787. Note that calling time.time()
before the print
statement means that the execution of the print
statement isn't included in your calculation of elapsed time.
I also tried
import statistics
start_time = time.time()
result = statistics.median(data)
end_time = time.time()
print("Result: ", result, end_time-start_time)
and got 0.7475762367248535.
What do mean by "not using heap", and why do you want that? What makes you think SortedList doesn't use a heap? If you're going to use pre-made functions, why not just use the pre-made median
function?
-
\$\begingroup\$ Just taking some challenges to learn Python. It was without heap, because there are many answers and duplicates. Thanks for the tip for measuring the time. \$\endgroup\$danieltakeshi– danieltakeshi2018年08月15日 19:56:45 +00:00Commented Aug 15, 2018 at 19:56
Thanks to Toby now the program takes:
--- 7.169409990310669 seconds to finish the program---
instead of --- 186.7168517112732 seconds---
for 100 milion numbers.
One other improvement I made was to add an extra optimization on the random generating algorithm, using numpy. So numpy.sort()
was used to sort the numpy array at the start of median()
, because it is faster for numpy arrays.
Code
import time
import numpy
start_time0 = time.time()
def median(values):
sortedlist = numpy.sort(values)
length = len(sortedlist)
half = length // 2
if length % 2 == 0:
return (sortedlist[half] + sortedlist[half - 1]) // 2
else:
return sortedlist[half]
def main():
m, n = 1000, 100000000
# Random Number Generator
start_time = time.time()
data = numpy.random.randint(1, m, n)
print("--- %s seconds to numpy random---" % (time.time() - start_time))
# Median
start_time = time.time()
result = median(data)
print("Result: ", result)
print("--- %s seconds to find the Median---" % (time.time() - start_time))
if __name__ == "__main__":
main()
print("--- %s seconds to finish the program---" % (time.time() - start_time0))
-
\$\begingroup\$ In this case, you might want to consider posting another question if you want another review? As it is, your newly refactored code isn't more helpful than the already accepted answer :) \$\endgroup\$IEatBagels– IEatBagels2018年08月15日 19:44:51 +00:00Commented Aug 15, 2018 at 19:44
-
3
just a note : you can enhance just a tidbit by defining the name in the list comprehension
from
data = [random.randrange(m) for i in range(n)]
to
data = [random.randrange(m) for num in range(n)]
O(n log n)
. You may want to look into Medians of Medians (scroll down for a python implementation) which has linear time complexity. \$\endgroup\$