I have a .csv file of 8k+ rows which looks like this:
state assembly candidate \
0 Andaman & Nicobar Islands Andaman & Nicobar Islands BISHNU PADA RAY
1 Andaman & Nicobar Islands Andaman & Nicobar Islands KULDEEP RAI SHARMA
2 Andaman & Nicobar Islands Andaman & Nicobar Islands SANJAY MESHACK
3 Andaman & Nicobar Islands Andaman & Nicobar Islands ANITA MONDAL
4 Andaman & Nicobar Islands Andaman & Nicobar Islands K.G.DAS
party votes
0 Bharatiya Janata Party 90969
1 Indian National Congress 83157
2 Aam Aadmi Party 3737
3 All India Trinamool Congress 2283
4 Communist Party of India (Marxist) 1777
The end dataframe I wanted to get was one which contains all the states as rows and two columns - one which has votes received by a particular party ("Bhartiya Janata Party"
, in this case) in that row's state and another which has the total votes from the state. Like this:
State Total Votes BJP Votes
Andaman & Nicobar Islands 190328 90969.0
Andhra Pradesh 48358545 4091876.0
Arunachal Pradesh 596956 275344.0
Assam 15085883 5507152.0
Bihar 35885366 10543023.0
My code works but I'm pretty sure there's a much better way to get this done using fewer lines of code and without creating too many dataframes. Here's my code:
dff = df.groupby(['party'])[['votes']].agg('sum')
dff = dff.sort_values('votes')
BJP_df = df[df["party"]=="Bharatiya Janata Party"]
#print(BJP_df.head())
group = BJP_df.groupby(['state'])[['votes']].agg('sum')
state = df.groupby(['state'])[['votes']].agg('sum')
result = pd.concat([state, group], axis = 1, sort=False)
result.columns = ["Total Votes","BJP Votes"]
Any tips, suggestions, pointers would be very much appreciated.
2 Answers 2
Here is one way using df.pivot_table()
:
Replace any other party except Bharatiya Janata Party
as Others
using np.where()
and then use pivot_table
, finally get sum()
across axis=1
for sum of votes.
df1=(df.assign(party=np.where(df.party.ne('Bharatiya Janata Party'),'Others',df.party)).
pivot_table(index='state',columns='party',values='votes',aggfunc='sum'))
Another method with crosstab()
similar to pivot_table
:
df1=pd.crosstab(df.state,np.where(df.party.ne('Bharatiya Janata Party'),'Others',df.party)
,df.votes,aggfunc='sum')
Finally, getting the Total and reset_index()
:
df1=df1.assign(Total=df1.sum(axis=1)).reset_index().rename_axis(None,axis=1)
Output: (Note: I had added dummy Andhra Pradesh
rows for testing)
state Bharatiya Janata Party Others Total
0 Andaman & Nicobar Islands 90969 90954 181923
1 Andhra Pradesh 100 85 185
You can opt to delete the Others
column later : df1=df1.drop('Others',1)
-
1\$\begingroup\$ Almost thought this question was lost in the depths of Code Review. Thanks for the answer! \$\endgroup\$Rahul– Rahul2019年06月24日 11:55:54 +00:00Commented Jun 24, 2019 at 11:55
-
\$\begingroup\$ @Abhishek My pleasure. :) i started contributing to this community starting today. :) \$\endgroup\$anky– anky2019年06月24日 12:08:31 +00:00Commented Jun 24, 2019 at 12:08
In all your code was not too bad. You can groupby on 2 items:
votes_per_state = df.groupby(["state", "party"])["votes"].sum().unstack(fill_value=0)
state Aam Aadmi Party All India Trinamool Congress Bharatiya Janata Party Communist Party of India (Marxist) Indian National Congress other Andaman & Nicobar Islands 3737 2283 90969 1777 83157 0 Andhra Pradesh 0 0 85 0 0 100
Then you can define which party you're interested in, and manually assemble a DataFrame
party_of_interest = "Bharatiya Janata Party"
result = pd.DataFrame(
{
party_of_interest: votes_per_state[party_of_interest],
"total": votes_per_state.sum(axis=1),
}
)
state Bharatiya Janata Party total Andaman & Nicobar Islands 90969 181923 Andhra Pradesh 85 185
If you want you can even add a percentage:
result = pd.DataFrame(
{
party_of_interest: votes_per_state[party_of_interest],
"total": votes_per_state.sum(axis=1),
"pct": (
votes_per_state[party_of_interest]
/ votes_per_state.sum(axis=1)
* 100
).round(1),
}
)
state Bharatiya Janata Party total pct Andaman & Nicobar Islands 90969 181923 50.0 Andhra Pradesh 85 185 45.9
-
\$\begingroup\$ I know that my code worked. I was just looking for something to improve efficiency as well as be more Pythonic. Seems like every project I work on ends up with me creating over 10-12 different dataframes. Don't know if that's just me. Thank you for your answer. \$\endgroup\$Rahul– Rahul2019年06月24日 13:04:01 +00:00Commented Jun 24, 2019 at 13:04