I'm trying to create an efficient function for re-sampling time-series data.
Assumption: Both sets of time-series data have the same start and end time. (I do this in a separate step.)
Resample function (inefficient)
import numpy as np
def resample(desired_time_sequence, data_sequence):
downsampling_indices = np.linspace(0, len(data_sequence)-1, len(desired_time_sequence)).round().astype(int)
downsampled_array = [data_sequence[ind] for ind in downsampling_indices]
return downsampled_array
Speed testing
import timeit
def test_speed(): resample([1,2,3], [.5,1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6])
print(timeit.timeit(test_speed, number=100000))
# 1.5003695999998854
Interested to hear any suggestions.
1 Answer 1
The function takes around \41ドル\mu s\$ on average per run on my machine. About three quarters of it (around \$ 32\mu s\$) are spent for downsampling_indices = np.linspace(...)
. Add another \1ドル.5\mu s\$ for round().astype(int)
, about \1ドル\mu s\$ for the actual sampling, plus some calling overhead, and you're there.
So if you would need to use the function several times, it would be best to precompute or cache/memoize sampling indices. If I understood your implementation correctly, the downsampling index computation is basically data independent and only depends on the length of the two sequences, so that might be actually viable.
For example you could have
import functools
...
@functools.lru_cache()
def compute_downsampling_indices_cached(n_samples, data_sequence_len):
"""Compute n_samples downsampling indices for data sequences of a given length"""
return np.linspace(0, data_sequence_len-1, n_samples).round().astype(int)
and then do
def resample_cache(n_samples, data_sequence):
downsampling_indices = compute_downsampling_indices_cached(n_samples, len(data_sequence))
return [data_sequence[ind] for ind in downsampling_indices]
Note that I replaced desired_time_sequence
by n_samples
which would then have to be set to len(desired_time_sequence)
since you don't care about the actual values in desired_time_sequence
.
It might also be possible to benefit from NumPy's indexing and use return np.array(data_sequence)[downsampling_indices]
for larger inputs. You will have to check that yourself.
On my machine resample_cache(...)
takes \1ドル.7\mu s\$, which is about a decent 20x speed up.
Explore related questions
See similar questions with these tags.