Excel's SUMIFS implemented using PANDAS, the Python Data Analysis Library

Question 1

I've implemented Excel's SUMIFS function in Pandas using the following code. Is there a better — more Pythonic — implementation?

from pandas import Series, DataFrame
import pandas as pd
df = pd.read_csv('data.csv')
# pandas equivalent of Excel's SUMIFS function
df.groupby('PROJECT').sum().ix['A001']

One concern I have with this implementation is that I'm not explicitly specifying the column to be summed.

Data File

Here's an example CSV data file (data.csv), although I'm displaying | instead of commas to improve the visual appearance.

DATE | EMPLOYEE | PROJECT | HOURS
02/01/14 | Smith, John | A001 | 4.0
02/01/14 | Smith, John | B002 | 4.0
02/01/14 | Doe, Jane | A001 | 3.0
02/01/14 | Doe, Jane | C003 | 5.0
02/02/14 | Smith, John | B002 | 2.0
02/02/14 | Smith, John | C003 | 6.0
02/02/14 | Doe, Jane | A001 | 8.0

Equivalent Excel SUMIFS Function

If I were to open data.csv in Excel and wanted to determine how many hours were worked on project A001, I would use the SUMIFS formula as follows:

=SUMIFS($D2:$D8, $C2:$C8, "A001")

Where the SUMIFS function syntax is:

=SUMIFS(sum_range, criteria_range1, criteria1, [criteria_range2,
 criteria2], ...)

Question 2

The usual approach -- if you want all the projects -- would be

>>> df.groupby("PROJECT")["HOURS"].sum()
PROJECT
A001 15
B002 6
C003 11
Name: HOURS, dtype: float64

This only applies the sum on the desired column, as this constructs an intermediate SeriesGroupBy object:

>>> df.groupby("PROJECT")["HOURS"]
<pandas.core.groupby.SeriesGroupBy object at 0xa94f8cc>

If you're only interested in the total hours of a particular project, then I suppose you could do

>>> df.loc[df.PROJECT == "A001", "HOURS"].sum()
15.0

or if you dislike the repetition of df:

>>> df.query("PROJECT == 'A001'")["HOURS"].sum()
15.0

but I find that I almost always want to be able to access more than one sum, so these are pretty rare patterns in my code.

Aside: .ix has fallen out of favour as it has some confusing behaviour. These days it's recommended to use .loc or .iloc to be explicit.

Question 3

If you want to do simple sum aggregation together with SUMIF, or multiple SUMIFS with different criteria simultaneously, I would suggest the following approach:

(
 df
 .assign(HOURS_A001 = lambda df: df.apply(lambda x: x.HOURS if x.PROJECT == "A001" else 0, axis=1))
 .agg({'HOURS': 'sum', 'HOURS_A001': 'sum'})
)

or without per-row apply (this version is much faster):

(
 df
 .assign(HOURS_A001 = lambda df: df.HOURS * np.where(df.PROJECT == "A001", 1, 0))
 .agg({'HOURS': 'sum', 'HOURS_A001': 'sum'})
)

So basically apply criteria and create a new row, then sum values in this row.

DSM 5013 silver badges9 bronze badges · Answer 1 · 2014-02-25 03:01:07Z

The usual approach -- if you want all the projects -- would be

>>> df.groupby("PROJECT")["HOURS"].sum()
PROJECT
A001 15
B002 6
C003 11
Name: HOURS, dtype: float64

This only applies the sum on the desired column, as this constructs an intermediate SeriesGroupBy object:

>>> df.groupby("PROJECT")["HOURS"]
<pandas.core.groupby.SeriesGroupBy object at 0xa94f8cc>

If you're only interested in the total hours of a particular project, then I suppose you could do

>>> df.loc[df.PROJECT == "A001", "HOURS"].sum()
15.0

or if you dislike the repetition of df:

>>> df.query("PROJECT == 'A001'")["HOURS"].sum()
15.0

but I find that I almost always want to be able to access more than one sum, so these are pretty rare patterns in my code.

Aside: .ix has fallen out of favour as it has some confusing behaviour. These days it's recommended to use .loc or .iloc to be explicit.

Dienow 1213 bronze badges · Answer 2 · 2016-12-29 08:45:18Z

If you want to do simple sum aggregation together with SUMIF, or multiple SUMIFS with different criteria simultaneously, I would suggest the following approach:

(
 df
 .assign(HOURS_A001 = lambda df: df.apply(lambda x: x.HOURS if x.PROJECT == "A001" else 0, axis=1))
 .agg({'HOURS': 'sum', 'HOURS_A001': 'sum'})
)

or without per-row apply (this version is much faster):

(
 df
 .assign(HOURS_A001 = lambda df: df.HOURS * np.where(df.PROJECT == "A001", 1, 0))
 .agg({'HOURS': 'sum', 'HOURS_A001': 'sum'})
)

So basically apply criteria and create a new row, then sum values in this row.

Stack Exchange Network

Excel's SUMIFS implemented using PANDAS, the Python Data Analysis Library

Data File

Equivalent Excel SUMIFS Function

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Excel's SUMIFS implemented using PANDAS, the Python Data Analysis Library

Data File

Equivalent Excel SUMIFS Function

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions