I am currently switching for R to Python so please be patient with me. Is the following a good way to count the number of rows given column names and values?
import pandas as pd
df = pd.DataFrame([["1", "2"], ["2", "4"], ["1", "4"]], columns=['A', 'B'])
cn1 = "A"
cn2 = "B"
cv1 = "1"
cv2 = "2"
no_rows = len(df[(df[cn1]==cv1) & (df[cn2]==cv2)].index)
print(no_rows)
3 Answers 3
First, it's a bad idea to input your numerics as strings in your dataframe. Use plain int
s instead.
Your code currently forms a predicate, performs a slice on the frame and then finds the size of the frame. This is more work than necessary - the predicate itself is a series of booleans, and running a .sum()
on it produces the number of matching values.
That, plus your current code is not general-purpose. A general-purpose implementation could look like
from typing import Dict, Any
import pandas as pd
def match_count(df: pd.DataFrame, **criteria: Any) -> int:
pairs = iter(criteria.items())
column, value = next(pairs)
predicate = df[column] == value
for column, value in pairs:
predicate &= df[column] == value
return predicate.sum()
def test() -> None:
df = pd.DataFrame(
[[1, 2],
[2, 4],
[1, 4]],
columns=['A', 'B'],
)
print(match_count(df, A=1, B=2))
if __name__ == '__main__':
test()
I usually use shape[0]
because it's more readable, so in your case it would be:
no_rows = df[(df[cn1]==cv1) & (df[cn2]==cv2)].shape[0]
While this specific example can be completely refactored into Reinderien's top-notch functions, we don't always need something so elaborate (e.g., quick exploratory analysis).
Masking and counting come up very often in one form or another, so I think it's still worth reviewing how to do them idiomatically in pandas.
Revised code
Maintaining the spirit of the original code, I would use something like:
matches = df[cn1].eq(cv1) & df[cn2].eq(cv2)
len(df[matches]) # but remember that matches.sum() is faster
Comments on the original code
len(df[(df[cn1] == cv1) & (df[cn2] == cv2)].index)
^ ^ ^ ^
3 2 4 1
No need to use
.index
explicitly sinceDataFrame.__len__
does it automatically:class DataFrame(NDFrame, OpsMixin): ... def __len__(self) -> int: return len(self.index)
DataFrame.eq
can sometimes be useful over==
:supports
axis
/level
broadcastingarguably more readable when joining multiple tests
df[cn1].eq(cv1) & df[cn2].eq(cv2) # (df[cn1] == cv1) & (df[cn2] == cv2)
arguably more readable when chaining methods (e.g., when comparing shifted columns)
df[cn1].shift().eq(cv1).cumsum() # (df[cn1].shift() == cv1).cumsum()
If speed is important,
len(df)
is faster thandf.shape[0]
(h/t @root):If you have a lot of conditions to join (e.g., generated via comprehension), consider
np.logical_and.reduce
:df[np.logical_and.reduce([ df[cn1] == cv1, # ... df[cn2] == cv2, ])]