Setup: I have ~30 parameters (dependent variables) measured simultaneously along a common time axis with ~1Hz resolution (independent variable). I need to calculate bin averages for all parameters, with a bin width of 10 seconds on the time axis. Potentially also other statistical parameters like standard deviation and median are desired. Binning should be lower limit included, upper excluded, with the last digit of the binned time axis ending to 5. I need to do this quite often (>3M entries total for one parameter set), so I look for an efficient solution.
Context: Binned data I need is an intermediate data product, meaning it's just read data --> binning --> write binned data. That makes me believe something 'high-level' like pandas
might bring too much overhead. I've read some posts on the topic on Stackoverflow, e.g. here or here. Also, I know there is scipy.stats.binned_statistic
(see here) but that would call binning of the time axis for every parameter, thus being inefficient imho.
What I've done so far:
# example input time axis t:
# note #1: there can be 'gaps' when data recording is interrupted
# note #2: t is guaranteed to be strictly increasing
t = np.concatenate((np.linspace(13, 24, 12), np.linspace(56, 100, 45)))
# 'binned' time axis:
tmin, tmax = np.rint(t[0]), np.rint(t[-1])
t_binned = np.arange((tmin-tmin%10)+5, (tmax-tmax%10)+6, 10)
# array([ 15, 25, 35, 45, 55, 65, 75, 85, 95, 105])
# calculate bin starting indices and bins
binstarts = np.append(t_binned-5, t_binned[-1]+5)
# since t is strictly increasing, I use np.searchsorted which is slightly faster than np.digitized
bins = np.searchsorted(binstarts, t, side='right')
# remove elements from the binned time axis that would not map to any values:
t_binned = t_binned[np.bincount(bins-1).astype(bool)]
# array([ 15, 25, 55, 65, 75, 85, 95, 105])
now that I have the bins, I can apply it to any parameter v_i from my input data with parameters v_n:
v_i = np.random.rand(t.size)
v_i_binavg = []
for bin_no in np.unique(bins):
v_i_binavg.append(np.nanmean(v_i[bins == bin_no]))
# potentially also other calculations here, like stdev or median
This works in principle but feels a bit hacky and I think efficiency could be improved, e.g. by getting rid of the for
loop in the last step. I found this to be a bottleneck in this approach (which is basically what I do now as well). Any suggestions?
p.s. for regular, 1D binning (avg or sum), I found this approach to be very efficient - however, that excludes the whole 'mapping to a time axis' problem.
1 Answer 1
Don't use linspace
, among other reasons because that doesn't return a useful dtype by default. Just use arange
, and then you can remove your calls to rint
. (Also: rint
doesn't return an integer dtype by default either!)
Rather than np.bincount(bins-1).astype(bool)
, I have a minor preference for nonzero()
as the latter is more literally what you're doing. There might be a slight performance difference but I have not checked.
Replace the module call np.searchsorted
with an ndarray
member call.
For testing, it's very important that you set Numpy's random seed to a constant value.
Remove your for
loop and use the secondary returns from unique()
to do your mean in a vectorised manner. This does the exact equivalent of your code, but does not do
other calculations here, like stdev or median
as those will take different expressions (but can be similarly vectorised with a weighted bincount
).
All together,
import numpy as np
# example input time axis t:
# note #1: there can be 'gaps' when data recording is interrupted
# note #2: t is guaranteed to be strictly increasing
t = np.concatenate((
np.arange(13, 25),
np.arange(56, 101),
))
# 'binned' time axis:
tmin, tmax = t[[0, -1]]
t_binned = np.arange(tmin - tmin%10 + 5, tmax - tmax%10 + 6, 10)
# array([ 15, 25, 35, 45, 55, 65, 75, 85, 95, 105])
# calculate bin starting indices and bins
bin_starts = np.append(t_binned - 5, t_binned[-1] + 5)
# since t is strictly increasing, use np.searchsorted which is slightly faster than np.digitized
bins = bin_starts.searchsorted(t, side='right')
# remove elements from the binned time axis that would not map to any values
idx, = np.nonzero(np.bincount(bins - 1))
t_binned = t_binned[idx]
# array([ 15, 25, 55, 65, 75, 85, 95, 105])
rand = np.random.default_rng(seed=0)
v_i = rand.random(t.size)
_, inverse, counts = np.unique(bins, return_inverse=True, return_counts=True)
v_i_binavg = np.bincount(inverse, weights=v_i)/counts
assert np.allclose(
(
0.4709873075421746, 0.6053572061057739, 0.44907522973138420, 0.45965187939900654,
0.6083595733115170, 0.5774665735891596, 0.42149207729932386, 0.45033936664928700,
),
v_i_binavg, atol=0, rtol=1e-15,
)