I have used pandas and sqlite to perform multiple conditional search in dataframe, such like:
name age height
0 john 18 178
1 jen 25 168
age > 20 & height < 170 & height > 150
I am wondering if numpy can do the same thing, if it can, will it be faster than pandas and sqlite?
Thanks
Amir Charkhi
8468 silver badges24 bronze badges
1 Answer 1
Yes, numpy can do the same maybe and faster than pandas:
df = pd.DataFrame({'name': {0: 'john', 1: 'jen'},
'age': {0: 18, 1: 25},
'height': {0: 178, 1: 168}})
print((df['age'] > 20) & (df['height'] < 170) & (df['height'] > 150))
0 False
1 True
dtype: bool
m = df.values.T # Note the transposition
print((m[1] > 20) & (m[2] < 170) & (m[2] > 150))
array([False, True])
Performance
>>> %timeit (df['age'] > 20) & (df['height'] < 170) & (df['height'] > 150)
392 μs ± 1.87 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit (m[1] > 20) & (m[2] < 170) & (m[2] > 150)
6.69 μs ± 12.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
answered Aug 6, 2021 at 22:19
Corralien
121k8 gold badges44 silver badges69 bronze badges
Sign up to request clarification or add additional context in comments.
3 Comments
mike
thanks! I supposed the 'df.values.T' is using numpy? I though it will be something like 'numpy.' ....
mike
is numpy.where() similar to pandas.query()?
Corralien
df.values.T convert your dataframe to ndarray and transpose the result. np.where and pd.query are not the same. Refer to the documentation.default