1
\$\begingroup\$

I have census data that looks like this

 State County TotalPop Hispanic White Black Native Asian Pacific
 Alabama Autauga 1948 0.9 87.4 7.7 0.3 0.6 0.0
 Alabama Autauga 2156 0.8 40.4 53.3 0.0 2.3 0.0
 Alabama Autauga 2968 0.0 74.5 18.6 0.5 1.4 0.3
 ...

Two things to note, (1) there can be multiple rows for a County and (2) the racial data is given in percentages, but sometimes I want the actual size of the population.

Getting the total racial population translates to (in pseudo Pandas):

(census.TotalPop * census.Hispanic / 100).groupby("County").sum()

But, this gives an error: KeyError: 'State'. As the product of TotalPop and Hispanic is a Pandas Series not the original dataframe.

As suggested by this Stack Overflow question, I can create a new column for each race...

census["HispanicPop"] = census.TotalPop * census.Hispanic / 100

This works, but feels messy, it adds 6 columns unnecessarily as I just need the data for one plot. Here is the data (I'm using "acs2015_census_tract_data.csv") and here is my implementation:

Working Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
census = pd.read_csv("data/acs2015_census_tract_data.csv")
races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']
# Creating a total population column for each race
# FIXME: this feels inefficient. Does Pandas have another option?
for race in races:
 census[race + "_pop"] = (census[race] * census.TotalPop) / 100
# current racial population being plotted
race = races[0]
# Sum the populations in each state
race_pops = census.groupby("State")[race + "_pop"].sum().sort_values(ascending=False)
#### Plotting the results for each state
fig, axarr = plt.subplots(2, 2, figsize=(18, 12))
fig.suptitle("{} population in all 52 states".format(race), fontsize=18)
# Splitting the plot into 4 subplots so I can fit all 52 States
data = race_pops.head(13)
sns.barplot(x=data.values, y=data.index, ax=axarr[0][0])
data = race_pops.iloc[13:26]
sns.barplot(x=data.values, y=data.index, ax=axarr[0][1]).set(ylabel="")
data = race_pops.iloc[26:39]
sns.barplot(x=data.values, y=data.index, ax=axarr[1][0])
data = race_pops.tail(13)
_ = sns.barplot(x=data.values, y=data.index, ax=axarr[1][1]).set(ylabel="")
asked May 16, 2018 at 4:51
\$\endgroup\$
6
  • \$\begingroup\$ Hmm... are you going to keep it here? it appears you posted working code... \$\endgroup\$ Commented May 16, 2018 at 16:45
  • \$\begingroup\$ Yeah I think this question belongs here, what do you think? My working code wasn't so obvious the first time. Sorry for the confusion \$\endgroup\$ Commented May 16, 2018 at 16:48
  • 1
    \$\begingroup\$ @Graipher In its current state, the code seems to work and he's asking for a better approach. This seems sufficiently on-topic to me. \$\endgroup\$ Commented May 16, 2018 at 17:07
  • 1
    \$\begingroup\$ Hi Sam, the variable census is the data I'm looking at and can be found here. I'll edit my question to include this too \$\endgroup\$ Commented May 16, 2018 at 17:22
  • \$\begingroup\$ @scnerd I agree, the current question is on-topic. \$\endgroup\$ Commented May 16, 2018 at 18:29

1 Answer 1

1
\$\begingroup\$

Since you only want to use the total population values for these plots it is not worth adding these columns to your census DataFrame. I would package the plots into a function which creates a temporary DataFrame that is used and then disposed of after the plotting is complete.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
def plot_populations(census, race):
 # Group the data
 race_pops = pd.DataFrame(data={
 'State': census['State'], 
 'Pop': census[race] * census['TotalPop'] / 100
 }
 ).groupby('State')['Pop'].sum().sort_values(ascending=False)
 # Plot the results
 fig, axarr = plt.subplots(2, 2, figsize=(18, 12))
 fig.suptitle("{} population in all 52 states".format(race), fontsize=18)
 for ix, ax in enumerate(axarr.reshape(-1)):
 data = race_pops.iloc[ix*len(race_pops)//4:(ix+1)*len(race_pops)//4]
 sns.barplot(x=data.values, y=data.index, ax=ax)
 if ix % 2 != 0: ax.set_ylabel('') 
census = pd.read_csv("acs2015_census_tract_data.csv")
races = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']
# current racial population being plotted
race = races[0]
plot_populations(census, race)
answered May 17, 2018 at 8:56
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.