4
\$\begingroup\$

I'm learning Pandas. I have a project where I need to munge some CSV files and resave them as CSV. I can use dictionaries and CSV module, but decided to use DataFrames to get more exposure and practice with Pandas.

The task is to sum() values for some keys while others are not summed. My solution was to add a column called "Line" where items with the same line number would be summed.

sales = [{"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
 {"Line": 1, "KEY": "Apples", "AMOUNT": 3.99},
 {"Line": 2, "KEY": "Oranges", "AMOUNT": 5.99},
 {"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
 {"Line": 3, "KEY": "Pears", "AMOUNT": 2.99},
 {"Line": 4, "KEY": "Carrots", "AMOUNT": .99},
 {"Line": 5, "KEY": "Carrots", "AMOUNT": .99},
 ]
df = pd.DataFrame(sales)

Now, Pandas groupby might seem like a perfect solution, but since I need to export the munged file as CSV, this seemed like something I could do only to have to undo.

# Find duplicate Line entries
# Subset df into just duplicate `Line` values
df_tmp = df[df.duplicated(subset="Line", keep=False)]
# Save a list of Line numbers to sum
line_dups = df_tmp['Line'].unique()
for x in line_dups:
 # Sum every line in the DF; one value sum is unchanged
 # asum = df.loc[df['Line'] == x, 'AMOUNT'].sum()
 # or
 # Subset the subset
 df_tmp2 = df_tmp[df_tmp["Line"] == x]
 # sum the sub-subset
 asum = df_tmp2['AMOUNT'].sum()
 # set the value of all keys with the same Line value
 df.loc[df['Line'] == x, 'AMOUNT'] = asum
 # take the inverse of the duplicate subset on the original df
 # Keep only the first duplicate line
 df2 = df[~df.duplicated(subset="Line", keep='first')]

The solution is effective, df2:

 AMOUNT KEY Line
0 7.98 Apples 1
2 5.99 Oranges 2
3 5.98 Pears 3
5 0.99 Carrots 4
6 0.99 Carrots 5

And yet, this solution is sort of hairy with all of the temp dataframes. Maybe a more seasoned Pandas user will have something to add which might help me better understand Pandas?

asked Apr 1, 2018 at 18:34
\$\endgroup\$
3
  • \$\begingroup\$ What do you mean that you need to undo the groupby to save your data to csv? \$\endgroup\$ Commented Jun 4, 2018 at 4:58
  • \$\begingroup\$ I see, that was a strange note--the problem is really a file conversion problem and can not produce new columns. From the groupby examples I've seen I expected changes to the index which I would need to undo. Maybe I was incorrect--if the same results could be obtained. \$\endgroup\$ Commented Jun 6, 2018 at 9:04
  • \$\begingroup\$ Is it important to maintain the index information (0, 2, 3, 5, 6)? Or do you really just want KEY and AMOUNT? I'm unclear about why exactly groupby doesn't work. \$\endgroup\$ Commented Jun 7, 2018 at 1:42

1 Answer 1

1
\$\begingroup\$

If you have the Line column in your CSV itself, you can replace what you have with:

df.groupby(['Line', 'KEY']).agg({'AMOUNT': 'sum'}).reset_index()
answered Apr 4, 2018 at 18:59
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.