I have census data that looks like this
State County TotalPop Hispanic White Black Native Asian Pacific
Alabama Autauga 1948 0.9 87.4 7.7 0.3 0.6 0.0
Alabama Autauga 2156 0.8 40.4 53.3 0.0 2.3 0.0
Alabama Autauga 2968 0.0 74.5 18.6 0.5 1.4 0.3
...
Two things to note, (1) there can be multiple rows for a County and (2) the racial data is given in percentages, but sometimes I want the actual size of the population.
Getting the total racial population translates to (in pseudo Pandas):
(census.TotalPop * census.Hispanic / 100).groupby("County").sum()
But, this gives an error: KeyError: 'State'
. As the product of TotalPop and Hispanic is a Pandas Series not the original dataframe.
As suggested by this Stack Overflow question, I can create a new column for each race...
census["HispanicPop"] = census.TotalPop * census.Hispanic / 100
This works, but feels messy, it adds 6 columns unnecessarily as I just need the data for one plot. Here is the data (I'm using "acs2015_census_tract_data.csv") and here is my implementation:
Working Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
census = pd.read_csv("data/acs2015_census_tract_data.csv")
races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']
# Creating a total population column for each race
# FIXME: this feels inefficient. Does Pandas have another option?
for race in races:
census[race + "_pop"] = (census[race] * census.TotalPop) / 100
# current racial population being plotted
race = races[0]
# Sum the populations in each state
race_pops = census.groupby("State")[race + "_pop"].sum().sort_values(ascending=False)
#### Plotting the results for each state
fig, axarr = plt.subplots(2, 2, figsize=(18, 12))
fig.suptitle("{} population in all 52 states".format(race), fontsize=18)
# Splitting the plot into 4 subplots so I can fit all 52 States
data = race_pops.head(13)
sns.barplot(x=data.values, y=data.index, ax=axarr[0][0])
data = race_pops.iloc[13:26]
sns.barplot(x=data.values, y=data.index, ax=axarr[0][1]).set(ylabel="")
data = race_pops.iloc[26:39]
sns.barplot(x=data.values, y=data.index, ax=axarr[1][0])
data = race_pops.tail(13)
_ = sns.barplot(x=data.values, y=data.index, ax=axarr[1][1]).set(ylabel="")
-
\$\begingroup\$ Hmm... are you going to keep it here? it appears you posted working code... \$\endgroup\$Sᴀᴍ Onᴇᴌᴀ– Sᴀᴍ Onᴇᴌᴀ ♦2018年05月16日 16:45:28 +00:00Commented May 16, 2018 at 16:45
-
\$\begingroup\$ Yeah I think this question belongs here, what do you think? My working code wasn't so obvious the first time. Sorry for the confusion \$\endgroup\$Patrick Stetz– Patrick Stetz2018年05月16日 16:48:25 +00:00Commented May 16, 2018 at 16:48
-
1\$\begingroup\$ @Graipher In its current state, the code seems to work and he's asking for a better approach. This seems sufficiently on-topic to me. \$\endgroup\$scnerd– scnerd2018年05月16日 17:07:49 +00:00Commented May 16, 2018 at 17:07
-
1\$\begingroup\$ Hi Sam, the variable census is the data I'm looking at and can be found here. I'll edit my question to include this too \$\endgroup\$Patrick Stetz– Patrick Stetz2018年05月16日 17:22:42 +00:00Commented May 16, 2018 at 17:22
-
\$\begingroup\$ @scnerd I agree, the current question is on-topic. \$\endgroup\$Graipher– Graipher2018年05月16日 18:29:24 +00:00Commented May 16, 2018 at 18:29
1 Answer 1
Since you only want to use the total population values for these plots it is not worth adding these columns to your census
DataFrame. I would package the plots into a function which creates a temporary DataFrame that is used and then disposed of after the plotting is complete.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
def plot_populations(census, race):
# Group the data
race_pops = pd.DataFrame(data={
'State': census['State'],
'Pop': census[race] * census['TotalPop'] / 100
}
).groupby('State')['Pop'].sum().sort_values(ascending=False)
# Plot the results
fig, axarr = plt.subplots(2, 2, figsize=(18, 12))
fig.suptitle("{} population in all 52 states".format(race), fontsize=18)
for ix, ax in enumerate(axarr.reshape(-1)):
data = race_pops.iloc[ix*len(race_pops)//4:(ix+1)*len(race_pops)//4]
sns.barplot(x=data.values, y=data.index, ax=ax)
if ix % 2 != 0: ax.set_ylabel('')
census = pd.read_csv("acs2015_census_tract_data.csv")
races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']
# current racial population being plotted
race = races[0]
plot_populations(census, race)