Grouping a Pandas Dataframe by two parameters

Question 1

I have a .csv file of 8k+ rows which looks like this:

 state assembly candidate \
0 Andaman & Nicobar Islands Andaman & Nicobar Islands BISHNU PADA RAY 
1 Andaman & Nicobar Islands Andaman & Nicobar Islands KULDEEP RAI SHARMA 
2 Andaman & Nicobar Islands Andaman & Nicobar Islands SANJAY MESHACK 
3 Andaman & Nicobar Islands Andaman & Nicobar Islands ANITA MONDAL 
4 Andaman & Nicobar Islands Andaman & Nicobar Islands K.G.DAS 
 party votes 
0 Bharatiya Janata Party 90969 
1 Indian National Congress 83157 
2 Aam Aadmi Party 3737 
3 All India Trinamool Congress 2283 
4 Communist Party of India (Marxist) 1777

The end dataframe I wanted to get was one which contains all the states as rows and two columns - one which has votes received by a particular party ("Bhartiya Janata Party", in this case) in that row's state and another which has the total votes from the state. Like this:

 State Total Votes BJP Votes
Andaman & Nicobar Islands 190328 90969.0
Andhra Pradesh 48358545 4091876.0
Arunachal Pradesh 596956 275344.0
Assam 15085883 5507152.0
Bihar 35885366 10543023.0

My code works but I'm pretty sure there's a much better way to get this done using fewer lines of code and without creating too many dataframes. Here's my code:

dff = df.groupby(['party'])[['votes']].agg('sum')
dff = dff.sort_values('votes')
BJP_df = df[df["party"]=="Bharatiya Janata Party"]
#print(BJP_df.head())
group = BJP_df.groupby(['state'])[['votes']].agg('sum')
state = df.groupby(['state'])[['votes']].agg('sum')
result = pd.concat([state, group], axis = 1, sort=False)
result.columns = ["Total Votes","BJP Votes"]

Any tips, suggestions, pointers would be very much appreciated.

Question 2

Here is one way using df.pivot_table() :

Replace any other party except Bharatiya Janata Party as Others using np.where() and then use pivot_table, finally get sum() across axis=1 for sum of votes.

df1=(df.assign(party=np.where(df.party.ne('Bharatiya Janata Party'),'Others',df.party)).
pivot_table(index='state',columns='party',values='votes',aggfunc='sum'))

Another method with crosstab() similar to pivot_table:

df1=pd.crosstab(df.state,np.where(df.party.ne('Bharatiya Janata Party'),'Others',df.party)
,df.votes,aggfunc='sum')

Finally, getting the Total and reset_index():

df1=df1.assign(Total=df1.sum(axis=1)).reset_index().rename_axis(None,axis=1)

Output: (Note: I had added dummy Andhra Pradesh rows for testing)

 state Bharatiya Janata Party Others Total
0 Andaman & Nicobar Islands 90969 90954 181923
1 Andhra Pradesh 100 85 185

You can opt to delete the Others column later : df1=df1.drop('Others',1)

Question 3

Almost thought this question was lost in the depths of Code Review. Thanks for the answer!

Question 4

@Abhishek My pleasure. :) i started contributing to this community starting today. :)

Question 5

In all your code was not too bad. You can groupby on 2 items:

votes_per_state = df.groupby(["state", "party"])["votes"].sum().unstack(fill_value=0)

state Aam Aadmi Party All India Trinamool Congress Bharatiya Janata Party Communist Party of India (Marxist) Indian National Congress other
Andaman & Nicobar Islands 3737 2283 90969 1777 83157 0
Andhra Pradesh 0 0 85 0 0 100

Then you can define which party you're interested in, and manually assemble a DataFrame

party_of_interest = "Bharatiya Janata Party"
result = pd.DataFrame(
 {
 party_of_interest: votes_per_state[party_of_interest],
 "total": votes_per_state.sum(axis=1),
 }
)

state Bharatiya Janata Party total
Andaman & Nicobar Islands 90969 181923
Andhra Pradesh 85 185

If you want you can even add a percentage:

result = pd.DataFrame(
 {
 party_of_interest: votes_per_state[party_of_interest],
 "total": votes_per_state.sum(axis=1),
 "pct": (
 votes_per_state[party_of_interest]
 / votes_per_state.sum(axis=1)
 * 100
 ).round(1),
 }
)

state Bharatiya Janata Party total pct
Andaman & Nicobar Islands 90969 181923 50.0
Andhra Pradesh 85 185 45.9

Question 6

I know that my code worked. I was just looking for something to improve efficiency as well as be more Pythonic. Seems like every project I work on ends up with me creating over 10-12 different dataframes. Don't know if that's just me. Thank you for your answer.

anky ankyanky 2962 silver badges8 bronze badges · Accepted Answer · 2019-06-24 09:40:49Z

Here is one way using df.pivot_table() :

Replace any other party except Bharatiya Janata Party as Others using np.where() and then use pivot_table, finally get sum() across axis=1 for sum of votes.

df1=(df.assign(party=np.where(df.party.ne('Bharatiya Janata Party'),'Others',df.party)).
pivot_table(index='state',columns='party',values='votes',aggfunc='sum'))

Another method with crosstab() similar to pivot_table:

df1=pd.crosstab(df.state,np.where(df.party.ne('Bharatiya Janata Party'),'Others',df.party)
,df.votes,aggfunc='sum')

Finally, getting the Total and reset_index():

df1=df1.assign(Total=df1.sum(axis=1)).reset_index().rename_axis(None,axis=1)

Output: (Note: I had added dummy Andhra Pradesh rows for testing)

 state Bharatiya Janata Party Others Total
0 Andaman & Nicobar Islands 90969 90954 181923
1 Andhra Pradesh 100 85 185

You can opt to delete the Others column later : df1=df1.drop('Others',1)

Almost thought this question was lost in the depths of Code Review. Thanks for the answer!
@Abhishek My pleasure. :) i started contributing to this community starting today. :)

Stack Exchange Network

Grouping a Pandas Dataframe by two parameters

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Grouping a Pandas Dataframe by two parameters

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions