6
\$\begingroup\$

I've implemented Excel's SUMIFS function in Pandas using the following code. Is there a better — more Pythonic — implementation?

from pandas import Series, DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
# pandas equivalent of Excel's SUMIFS function
df.groupby('PROJECT').sum().ix['A001']

One concern I have with this implementation is that I'm not explicitly specifying the column to be summed.

Data File

Here's an example CSV data file (data.csv), although I'm displaying | instead of commas to improve the visual appearance.

DATE | EMPLOYEE | PROJECT | HOURS
02/01/14 | Smith, John | A001 | 4.0
02/01/14 | Smith, John | B002 | 4.0
02/01/14 | Doe, Jane | A001 | 3.0
02/01/14 | Doe, Jane | C003 | 5.0
02/02/14 | Smith, John | B002 | 2.0
02/02/14 | Smith, John | C003 | 6.0
02/02/14 | Doe, Jane | A001 | 8.0

Equivalent Excel SUMIFS Function

If I were to open data.csv in Excel and wanted to determine how many hours were worked on project A001, I would use the SUMIFS formula as follows:

=SUMIFS($D2:$D8, $C2:$C8, "A001")

Where the SUMIFS function syntax is:

=SUMIFS(sum_range, criteria_range1, criteria1, [criteria_range2,
 criteria2], ...)
200_success
146k22 gold badges191 silver badges481 bronze badges
asked Feb 24, 2014 at 19:40
\$\endgroup\$

2 Answers 2

5
\$\begingroup\$

The usual approach -- if you want all the projects -- would be

>>> df.groupby("PROJECT")["HOURS"].sum()
PROJECT
A001 15
B002 6
C003 11
Name: HOURS, dtype: float64

This only applies the sum on the desired column, as this constructs an intermediate SeriesGroupBy object:

>>> df.groupby("PROJECT")["HOURS"]
<pandas.core.groupby.SeriesGroupBy object at 0xa94f8cc>

If you're only interested in the total hours of a particular project, then I suppose you could do

>>> df.loc[df.PROJECT == "A001", "HOURS"].sum()
15.0

or if you dislike the repetition of df:

>>> df.query("PROJECT == 'A001'")["HOURS"].sum()
15.0

but I find that I almost always want to be able to access more than one sum, so these are pretty rare patterns in my code.

Aside: .ix has fallen out of favour as it has some confusing behaviour. These days it's recommended to use .loc or .iloc to be explicit.

answered Feb 25, 2014 at 3:01
\$\endgroup\$
2
\$\begingroup\$

If you want to do simple sum aggregation together with SUMIF, or multiple SUMIFS with different criteria simultaneously, I would suggest the following approach:

(
 df
 .assign(HOURS_A001 = lambda df: df.apply(lambda x: x.HOURS if x.PROJECT == "A001" else 0, axis=1))
 .agg({'HOURS': 'sum', 'HOURS_A001': 'sum'})
)

or without per-row apply (this version is much faster):

(
 df
 .assign(HOURS_A001 = lambda df: df.HOURS * np.where(df.PROJECT == "A001", 1, 0))
 .agg({'HOURS': 'sum', 'HOURS_A001': 'sum'})
)

So basically apply criteria and create a new row, then sum values in this row.

answered Dec 29, 2016 at 8:45
\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.