6
\$\begingroup\$

I've implemented Excel's SUMIFS function in Pandas using the following code. Is there a better — more Pythonic — implementation?

from pandas import Series, DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
# pandas equivalent of Excel's SUMIFS function
df.groupby('PROJECT').sum().ix['A001']

One concern I have with this implementation is that I'm not explicitly specifying the column to be summed.

Data File

Here's an example CSV data file (data.csv), although I'm displaying | instead of commas to improve the visual appearance.

DATE | EMPLOYEE | PROJECT | HOURS
02/01/14 | Smith, John | A001 | 4.0
02/01/14 | Smith, John | B002 | 4.0
02/01/14 | Doe, Jane | A001 | 3.0
02/01/14 | Doe, Jane | C003 | 5.0
02/02/14 | Smith, John | B002 | 2.0
02/02/14 | Smith, John | C003 | 6.0
02/02/14 | Doe, Jane | A001 | 8.0

Equivalent Excel SUMIFS Function

If I were to open data.csv in Excel and wanted to determine how many hours were worked on project A001, I would use the SUMIFS formula as follows:

=SUMIFS($D2:$D8, $C2:$C8, "A001")

Where the SUMIFS function syntax is:

=SUMIFS(sum_range, criteria_range1, criteria1, [criteria_range2,
 criteria2], ...)
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Feb 24, 2014 at 19:40
\$\endgroup\$

2 Answers 2

5
\$\begingroup\$

The usual approach -- if you want all the projects -- would be

>>> df.groupby("PROJECT")["HOURS"].sum()
PROJECT
A001 15
B002 6
C003 11
Name: HOURS, dtype: float64

This only applies the sum on the desired column, as this constructs an intermediate SeriesGroupBy object:

>>> df.groupby("PROJECT")["HOURS"]
<pandas.core.groupby.SeriesGroupBy object at 0xa94f8cc>

If you're only interested in the total hours of a particular project, then I suppose you could do

>>> df.loc[df.PROJECT == "A001", "HOURS"].sum()
15.0

or if you dislike the repetition of df:

>>> df.query("PROJECT == 'A001'")["HOURS"].sum()
15.0

but I find that I almost always want to be able to access more than one sum, so these are pretty rare patterns in my code.

Aside: .ix has fallen out of favour as it has some confusing behaviour. These days it's recommended to use .loc or .iloc to be explicit.

answered Feb 25, 2014 at 3:01
\$\endgroup\$
2
\$\begingroup\$

If you want to do simple sum aggregation together with SUMIF, or multiple SUMIFS with different criteria simultaneously, I would suggest the following approach:

(
 df
 .assign(HOURS_A001 = lambda df: df.apply(lambda x: x.HOURS if x.PROJECT == "A001" else 0, axis=1))
 .agg({'HOURS': 'sum', 'HOURS_A001': 'sum'})
)

or without per-row apply (this version is much faster):

(
 df
 .assign(HOURS_A001 = lambda df: df.HOURS * np.where(df.PROJECT == "A001", 1, 0))
 .agg({'HOURS': 'sum', 'HOURS_A001': 'sum'})
)

So basically apply criteria and create a new row, then sum values in this row.

answered Dec 29, 2016 at 8:45
\$\endgroup\$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.