Multiply two columns of Census data and groupby

Question 1

I have census data that looks like this

 State County TotalPop Hispanic White Black Native Asian Pacific
 Alabama Autauga 1948 0.9 87.4 7.7 0.3 0.6 0.0
 Alabama Autauga 2156 0.8 40.4 53.3 0.0 2.3 0.0
 Alabama Autauga 2968 0.0 74.5 18.6 0.5 1.4 0.3
 ...

Two things to note, (1) there can be multiple rows for a County and (2) the racial data is given in percentages, but sometimes I want the actual size of the population.

Getting the total racial population translates to (in pseudo Pandas):

(census.TotalPop * census.Hispanic / 100).groupby("County").sum()

But, this gives an error: KeyError: 'State'. As the product of TotalPop and Hispanic is a Pandas Series not the original dataframe.

As suggested by this Stack Overflow question, I can create a new column for each race...

census["HispanicPop"] = census.TotalPop * census.Hispanic / 100

This works, but feels messy, it adds 6 columns unnecessarily as I just need the data for one plot. Here is the data (I'm using "acs2015_census_tract_data.csv") and here is my implementation:

Working Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
census = pd.read_csv("data/acs2015_census_tract_data.csv")
races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']
# Creating a total population column for each race
# FIXME: this feels inefficient. Does Pandas have another option?
for race in races:
 census[race + "_pop"] = (census[race] * census.TotalPop) / 100
# current racial population being plotted
race = races[0]
# Sum the populations in each state
race_pops = census.groupby("State")[race + "_pop"].sum().sort_values(ascending=False)
#### Plotting the results for each state
fig, axarr = plt.subplots(2, 2, figsize=(18, 12))
fig.suptitle("{} population in all 52 states".format(race), fontsize=18)
# Splitting the plot into 4 subplots so I can fit all 52 States
data = race_pops.head(13)
sns.barplot(x=data.values, y=data.index, ax=axarr[0][0])
data = race_pops.iloc[13:26]
sns.barplot(x=data.values, y=data.index, ax=axarr[0][1]).set(ylabel="")
data = race_pops.iloc[26:39]
sns.barplot(x=data.values, y=data.index, ax=axarr[1][0])
data = race_pops.tail(13)
_ = sns.barplot(x=data.values, y=data.index, ax=axarr[1][1]).set(ylabel="")

Question 2

Hmm... are you going to keep it here? it appears you posted working code...

Question 3

Yeah I think this question belongs here, what do you think? My working code wasn't so obvious the first time. Sorry for the confusion

Question 4

@Graipher In its current state, the code seems to work and he's asking for a better approach. This seems sufficiently on-topic to me.

Question 5

Hi Sam, the variable census is the data I'm looking at and can be found here. I'll edit my question to include this too

Question 6

@scnerd I agree, the current question is on-topic.

Question 7

Since you only want to use the total population values for these plots it is not worth adding these columns to your census DataFrame. I would package the plots into a function which creates a temporary DataFrame that is used and then disposed of after the plotting is complete.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
def plot_populations(census, race):
 # Group the data
 race_pops = pd.DataFrame(data={
 'State': census['State'], 
 'Pop': census[race] * census['TotalPop'] / 100
 }
 ).groupby('State')['Pop'].sum().sort_values(ascending=False)
 # Plot the results
 fig, axarr = plt.subplots(2, 2, figsize=(18, 12))
 fig.suptitle("{} population in all 52 states".format(race), fontsize=18)
 for ix, ax in enumerate(axarr.reshape(-1)):
 data = race_pops.iloc[ix*len(race_pops)//4:(ix+1)*len(race_pops)//4]
 sns.barplot(x=data.values, y=data.index, ax=ax)
 if ix % 2 != 0: ax.set_ylabel('') 
census = pd.read_csv("acs2015_census_tract_data.csv")
races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']
# current racial population being plotted
race = races[0]
plot_populations(census, race)

JahKnows JahKnows 1112 bronze badges · Answer 1 · 2018-05-17 08:56:30Z

Since you only want to use the total population values for these plots it is not worth adding these columns to your census DataFrame. I would package the plots into a function which creates a temporary DataFrame that is used and then disposed of after the plotting is complete.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
def plot_populations(census, race):
 # Group the data
 race_pops = pd.DataFrame(data={
 'State': census['State'], 
 'Pop': census[race] * census['TotalPop'] / 100
 }
 ).groupby('State')['Pop'].sum().sort_values(ascending=False)
 # Plot the results
 fig, axarr = plt.subplots(2, 2, figsize=(18, 12))
 fig.suptitle("{} population in all 52 states".format(race), fontsize=18)
 for ix, ax in enumerate(axarr.reshape(-1)):
 data = race_pops.iloc[ix*len(race_pops)//4:(ix+1)*len(race_pops)//4]
 sns.barplot(x=data.values, y=data.index, ax=ax)
 if ix % 2 != 0: ax.set_ylabel('') 
census = pd.read_csv("acs2015_census_tract_data.csv")
races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']
# current racial population being plotted
race = races[0]
plot_populations(census, race)

Stack Exchange Network

Multiply two columns of Census data and groupby

Working Code

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Multiply two columns of Census data and groupby

Working Code

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions