1
\$\begingroup\$

I am currently switching for R to Python so please be patient with me. Is the following a good way to count the number of rows given column names and values?

import pandas as pd
df = pd.DataFrame([["1", "2"], ["2", "4"], ["1", "4"]], columns=['A', 'B'])
cn1 = "A"
cn2 = "B"
cv1 = "1"
cv2 = "2"
no_rows = len(df[(df[cn1]==cv1) & (df[cn2]==cv2)].index)
print(no_rows)
asked Dec 6, 2021 at 9:45
\$\endgroup\$

3 Answers 3

1
\$\begingroup\$

First, it's a bad idea to input your numerics as strings in your dataframe. Use plain ints instead.

Your code currently forms a predicate, performs a slice on the frame and then finds the size of the frame. This is more work than necessary - the predicate itself is a series of booleans, and running a .sum() on it produces the number of matching values.

That, plus your current code is not general-purpose. A general-purpose implementation could look like

from typing import Dict, Any
import pandas as pd
def match_count(df: pd.DataFrame, **criteria: Any) -> int:
 pairs = iter(criteria.items())
 column, value = next(pairs)
 predicate = df[column] == value
 for column, value in pairs:
 predicate &= df[column] == value
 return predicate.sum()
def test() -> None:
 df = pd.DataFrame(
 [[1, 2],
 [2, 4],
 [1, 4]],
 columns=['A', 'B'],
 )
 print(match_count(df, A=1, B=2))
if __name__ == '__main__':
 test()
answered Dec 6, 2021 at 17:34
\$\endgroup\$
1
\$\begingroup\$

I usually use shape[0] because it's more readable, so in your case it would be:

no_rows = df[(df[cn1]==cv1) & (df[cn2]==cv2)].shape[0]
answered Dec 6, 2021 at 16:35
\$\endgroup\$
1
\$\begingroup\$

While this specific example can be completely refactored into Reinderien's top-notch functions, we don't always need something so elaborate (e.g., quick exploratory analysis).

Masking and counting come up very often in one form or another, so I think it's still worth reviewing how to do them idiomatically in pandas.


Revised code

Maintaining the spirit of the original code, I would use something like:

matches = df[cn1].eq(cv1) & df[cn2].eq(cv2)
len(df[matches]) # but remember that matches.sum() is faster

Comments on the original code

len(df[(df[cn1] == cv1) & (df[cn2] == cv2)].index)
^ ^ ^ ^
3 2 4 1
  1. No need to use .index explicitly since DataFrame.__len__ does it automatically:

    class DataFrame(NDFrame, OpsMixin):
     ...
     def __len__(self) -> int:
     return len(self.index)
    
  2. DataFrame.eq can sometimes be useful over ==:

    • supports axis / level broadcasting

    • arguably more readable when joining multiple tests

      df[cn1].eq(cv1) & df[cn2].eq(cv2) # (df[cn1] == cv1) & (df[cn2] == cv2)
      
    • arguably more readable when chaining methods (e.g., when comparing shifted columns)

      df[cn1].shift().eq(cv1).cumsum() # (df[cn1].shift() == cv1).cumsum()
      
  3. If speed is important, len(df) is faster than df.shape[0] (h/t @root):

  4. If you have a lot of conditions to join (e.g., generated via comprehension), consider np.logical_and.reduce:

    df[np.logical_and.reduce([
     df[cn1] == cv1,
     # ...
     df[cn2] == cv2,
    ])]
    
answered Dec 8, 2021 at 6:38
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.