I was trying to implement my own version of matplotlib
's hist
function and I came up with this:
import numpy as np
import matplotlib.pyplot as plt
import numba as nb
from numba import prange
import time
N = 10_000_000
normal = np.random.standard_normal(N)
@nb.jit(parallel=True)
def histogram(a, bins = 1000, density=False):
min_data = np.min(a)
max_data = np.max(a)
dx = (max_data - min_data)/bins
D = np.zeros([bins,2])
for i in prange(bins):
dr = i*dx + min_data
x1 = a[a>dr]
x2 = x1[x1<(dr+dx)]
D[i] = [dr, len(x2)]
D = D.T
x , y = D
if density == True:
inte = sum((x[1]-x[0])*y)
y /= inte
plt.bar(x, y, width=dx)
return D
def main():
start_time = time.clock()
histogram(normal, bins=1000, density=True)
#plt.hist(normal, bins=1000, density=True)
print("Execution time:", time.clock() - start_time, "seconds")
plt.tight_layout()
plt.show()
if __name__ == '__main__':
main()
This outputs the desired result, but with the execution time of 24.3555
seconds. However, when I use the original hist
(just comment my function and uncomment plt.hist
) it does the job in just 0.6283
seconds, which is ridiculously fast. My question is, what am I doing wrong? And how does hist
achieve this performance?
P.S: Using this bit of a code, you can see the the result of hist
function and my implementation:
fig, axs = plt.subplots(1,2)
axs[0].hist(normal, bins=100, density=True)
axs[1] = histogram(normal, bins=100, density=True)
plt.tight_layout()
plt.show()
Where the left one is hist
s output and the one on right is my implementation; which both are exactly the same.
-
1\$\begingroup\$ Can you show screenshots of your histogram? \$\endgroup\$Reinderien– Reinderien2021年10月18日 17:59:08 +00:00Commented Oct 18, 2021 at 17:59
-
\$\begingroup\$ @Reinderien I just did. \$\endgroup\$Amirhossein Rezaei– Amirhossein Rezaei2021年10月18日 18:47:35 +00:00Commented Oct 18, 2021 at 18:47
1 Answer 1
You are slicing your original dataset 1000 times, making 2000 comparisons with borders for each of the values in it. It is much faster to calculate the proper bin for each value using simple math:
bin = (value - min_data) / dx
D[bin][1] += 1
And you can use original circle to set all В[bin][0] The only problem, that at max_data, bin would be equal bins (not bins-1 as expected), you can add additional if or just add additional bin D = np.zeros([bins+1,2]) and then add from bins to bins-1 and drop the bin=bins.
UPD: So here is my full code (on the base of original code with some refactoring)
def histogram(a, bins = 1000, density=False):
min_data = np.min(a)
max_data = np.max(a)
dx = (max_data - min_data) / bins
x = np.zeros(bins)
y = np.zeros(bins+1)
for i in range(bins):
x[i] = i*dx + min_data
for v in a:
bin = int((v - min_data) / dx)
y[bin] += 1
y[bins-2] += y[bins-1]
y = y[:bins]
if density == True:
inte = sum((x[1]-x[0])*y)
y /= inte
plt.bar(x, y, width=dx)
return np.column_stack((x, y))
It gave me 15 seconds in comparision with 60 seconds of original code. What else can we do? The most intense calculations are in this circle
for v in a:
bin = int((v - min_data) / dx)
y[bin] += 1
And they are made pure python which is really slow (C++ is almost 10 times faster, and it is used to optimise inner calculation in numpy). So we'd better make these calculations with numpy (just change previous 3 lines to that 3 lines).
a_to_bins = ((a - min_data) / dx).astype(int)
for bin in a_to_bins:
y[bin] += 1
That gave me the final time about 7 seconds.
UPD 2: What is the slowest part now? As we know python intensive calculations are slow. Let's check it:
def histogram(a, bins = 1000, density=False):
start_time = time.perf_counter()
min_data = np.min(a)
max_data = np.max(a)
dx = (max_data - min_data) / bins
print(time.perf_counter() - start_time, 'to calc min/max')
x = np.zeros(bins)
y = np.zeros(bins+1)
print(time.perf_counter() - start_time, 'to create x, y')
for i in range(bins):
x[i] = i*dx + min_data
print(time.perf_counter() - start_time, 'to calc bin borders')
a_to_bins = ((a - min_data) / dx).astype(int)
print(time.perf_counter() - start_time, 'to calc bins')
for bin in a_to_bins:
y[bin] += 1
print(time.perf_counter() - start_time, 'to fill bins')
y[bins-2] += y[bins-1]
y = y[:bins]
if density == True:
inte = sum((x[1]-x[0])*y)
y /= inte
print(time.perf_counter() - start_time, 'before draw')
plt.bar(x, y, width=dx)
print(time.perf_counter() - start_time, 'after draw')
return np.column_stack((x, y))
Gives me:
0.010483399993972853 to calc min/max
0.011489700002130121 to create x, y
0.012588899990078062 to calc bin borders
0.09252060001017526 to calc bins
7.7265202999988105 to fill bins
7.727168200013693 before draw
8.440735899988795 after draw
So almost all the time was consumed by python circle to add 1 to the bin (pay attention that previous calculation with numpy made all bin calculation for less than 0.1 seconds). So if we rewrite that code in C++ we would probably get 10x faster result which is pretty close to the original .hist timing.
By the way, if I use @nb.jit(parallel=True)
than time goes down to
0.014584699994884431 to calc min/max
0.014821599994320422 to create x, y
0.3439053999900352 to calc bin borders
0.5012229999992996 to calc bins
0.8304882999800611 to fill bins
0.8317092999932356 before draw
1.5190471999812871 after draw
While original .hist
on my PC takes 0.8466330000082962 (I don't know if it uses parallel computation or not).
In total: pure python calculations are terribly slow, transfering them into C++ makes huge impact on speed. That is why you'd better use numpy calculations (which uses C++ under the hood) instead of python circles. It can give the speed up around 10x or even more.
-
\$\begingroup\$ I'm not exactly sure how your solution works. Can you make a reproducible example? \$\endgroup\$Amirhossein Rezaei– Amirhossein Rezaei2021年10月18日 18:48:56 +00:00Commented Oct 18, 2021 at 18:48
-
1\$\begingroup\$ Updated the post with the full code \$\endgroup\$Leonid Mednikov– Leonid Mednikov2021年10月19日 16:18:44 +00:00Commented Oct 19, 2021 at 16:18
-
\$\begingroup\$ This is very good, but still has some significant difference with the original hist function's 0.6283 seconds. So, the question becomes: what is happening inside the hist function that makes it that fast?(!) \$\endgroup\$Amirhossein Rezaei– Amirhossein Rezaei2021年10月19日 20:33:35 +00:00Commented Oct 19, 2021 at 20:33
Explore related questions
See similar questions with these tags.