This notebook was prepared by Donne Martin. Source and license info is on GitHub.

Pandas¶

Credits: The following are notes taken while working through Python for Data Analysis by Wes McKinney

Series
DataFrame
Reindexing
Dropping Entries
Indexing, Selecting, Filtering
Arithmetic and Data Alignment
Function Application and Mapping
Sorting and Ranking
Axis Indices with Duplicate Values
Summarizing and Computing Descriptive Statistics
Cleaning Data (Under Construction)
Input and Output (Under Construction)

In [1]:

frompandasimport Series, DataFrame
importpandasaspd
importnumpyasnp

Series¶

A Series is a one-dimensional array-like object containing an array of data and an associated array of data labels. The data can be any NumPy data type and the labels are the Series' index.

Create a Series:

In [2]:

ser_1 = Series([1, 1, 2, -3, -5, 8, 13])
ser_1

Out[2]:

0 1
1 1
2 2
3 -3
4 -5
5 8
6 13
dtype: int64

Get the array representation of a Series:

In [3]:

ser_1.values

Out[3]:

array([ 1, 1, 2, -3, -5, 8, 13])

Index objects are immutable and hold the axis labels and metadata such as names and axis names.

Get the index of the Series:

In [4]:

ser_1.index

Out[4]:

Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')

Create a Series with a custom index:

In [5]:

ser_2 = Series([1, 1, 2, -3, -5], index=['a', 'b', 'c', 'd', 'e'])
ser_2

Out[5]:

a 1
b 1
c 2
d -3
e -5
dtype: int64

Get a value from a Series:

In [6]:

ser_2[4] == ser_2['e']

Out[6]:

True

Get a set of values from a Series by passing in a list:

In [7]:

ser_2[['c', 'a', 'b']]

Out[7]:

c 2
a 1
b 1
dtype: int64

Get values great than 0:

In [8]:

ser_2[ser_2 > 0]

Out[8]:

a 1
b 1
c 2
dtype: int64

Scalar multiply:

In [9]:

ser_2 * 2

Out[9]:

a 2
b 2
c 4
d -6
e -10
dtype: int64

Apply a numpy math function:

In [10]:

importnumpyasnp
np.exp(ser_2)

Out[10]:

a 2.718282
b 2.718282
c 7.389056
d 0.049787
e 0.006738
dtype: float64

A Series is like a fixed-length, ordered dict.

Create a series by passing in a dict:

In [11]:

dict_1 = {'foo' : 100, 'bar' : 200, 'baz' : 300}
ser_3 = Series(dict_1)
ser_3

Out[11]:

bar 200
baz 300
foo 100
dtype: int64

Re-order a Series by passing in an index (indices not found are NaN):

In [12]:

index = ['foo', 'bar', 'baz', 'qux']
ser_4 = Series(dict_1, index=index)
ser_4

Out[12]:

foo 100
bar 200
baz 300
qux NaN
dtype: float64

Check for NaN with the pandas method:

In [13]:

pd.isnull(ser_4)

Out[13]:

foo False
bar False
baz False
qux True
dtype: bool

Check for NaN with the Series method:

In [14]:

ser_4.isnull()

Out[14]:

foo False
bar False
baz False
qux True
dtype: bool

Series automatically aligns differently indexed data in arithmetic operations:

In [15]:

ser_3 + ser_4

Out[15]:

bar 400
baz 600
foo 200
qux NaN
dtype: float64

Name a Series:

In [16]:

ser_4.name = 'foobarbazqux'

Name a Series index:

In [17]:

ser_4.index.name = 'label'

In [18]:

ser_4

Out[18]:

label
foo 100
bar 200
baz 300
qux NaN
Name: foobarbazqux, dtype: float64

Rename a Series' index in place:

In [19]:

ser_4.index = ['fo', 'br', 'bz', 'qx']
ser_4

Out[19]:

fo 100
br 200
bz 300
qx NaN
Name: foobarbazqux, dtype: float64

DataFrame¶

A DataFrame is a tabular data structure containing an ordered collection of columns. Each column can have a different type. DataFrames have both row and column indices and is analogous to a dict of Series. Row and column operations are treated roughly symmetrically. Columns returned when indexing a DataFrame are views of the underlying data, not a copy. To obtain a copy, use the Series' copy method.

Create a DataFrame:

In [20]:

data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
 'year' : [2012, 2013, 2014, 2014, 2015],
 'pop' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = DataFrame(data_1)
df_1

Out[20]:

pop	state	year
0	5.0	VA	2012
1	5.1	VA	2013
2	5.2	VA	2014
3	4.0	MD	2014
4	4.1	MD	2015

Create a DataFrame specifying a sequence of columns:

In [21]:

df_2 = DataFrame(data_1, columns=['year', 'state', 'pop'])
df_2

Out[21]:

year	state	pop
0	2012	VA	5.0
1	2013	VA	5.1
2	2014	VA	5.2
3	2014	MD	4.0
4	2015	MD	4.1

Like Series, columns that are not present in the data are NaN:

In [22]:

df_3 = DataFrame(data_1, columns=['year', 'state', 'pop', 'unempl'])
df_3

Out[22]:

year	state	pop	unempl
0	2012	VA	5.0	NaN
1	2013	VA	5.1	NaN
2	2014	VA	5.2	NaN
3	2014	MD	4.0	NaN
4	2015	MD	4.1	NaN

Retrieve a column by key, returning a Series:

In [23]:

df_3['state']

Out[23]:

0 VA
1 VA
2 VA
3 MD
4 MD
Name: state, dtype: object

Retrive a column by attribute, returning a Series:

In [24]:

df_3.year

Out[24]:

0 2012
1 2013
2 2014
3 2014
4 2015
Name: year, dtype: int64

Retrieve a row by position:

In [25]:

df_3.ix[0]

Out[25]:

year 2012
state VA
pop 5
unempl NaN
Name: 0, dtype: object

Update a column by assignment:

In [26]:

df_3['unempl'] = np.arange(5)
df_3

Out[26]:

year	state	pop	unempl
0	2012	VA	5.0	0
1	2013	VA	5.1	1
2	2014	VA	5.2	2
3	2014	MD	4.0	3
4	2015	MD	4.1	4

Assign a Series to a column (note if assigning a list or array, the length must match the DataFrame, unlike a Series):

In [27]:

unempl = Series([6.0, 6.0, 6.1], index=[2, 3, 4])
df_3['unempl'] = unempl
df_3

Out[27]:

year	state	pop	unempl
0	2012	VA	5.0	NaN
1	2013	VA	5.1	NaN
2	2014	VA	5.2	6.0
3	2014	MD	4.0	6.0
4	2015	MD	4.1	6.1

Assign a new column that doesn't exist to create a new column:

In [28]:

df_3['state_dup'] = df_3['state']
df_3

Out[28]:

year	state	pop	unempl	state_dup
0	2012	VA	5.0	NaN	VA
1	2013	VA	5.1	NaN	VA
2	2014	VA	5.2	6.0	VA
3	2014	MD	4.0	6.0	MD
4	2015	MD	4.1	6.1	MD

Delete a column:

In [29]:

del df_3['state_dup']
df_3

Out[29]:

year	state	pop	unempl
0	2012	VA	5.0	NaN
1	2013	VA	5.1	NaN
2	2014	VA	5.2	6.0
3	2014	MD	4.0	6.0
4	2015	MD	4.1	6.1

Create a DataFrame from a nested dict of dicts (the keys in the inner dicts are unioned and sorted to form the index in the result, unless an explicit index is specified):

In [30]:

pop = {'VA' : {2013 : 5.1, 2014 : 5.2},
 'MD' : {2014 : 4.0, 2015 : 4.1}}
df_4 = DataFrame(pop)
df_4

Out[30]:

MD	VA
2013	NaN	5.1
2014	4.0	5.2
2015	4.1	NaN

Transpose the DataFrame:

In [31]:

df_4.T

Out[31]:

2013	2014	2015
MD	NaN	4.0	4.1
VA	5.1	5.2	NaN

Create a DataFrame from a dict of Series:

In [32]:

data_2 = {'VA' : df_4['VA'][1:],
 'MD' : df_4['MD'][2:]}
df_5 = DataFrame(data_2)
df_5

Out[32]:

MD	VA
2014	NaN	5.2
2015	4.1	NaN

Set the DataFrame index name:

In [33]:

df_5.index.name = 'year'
df_5

Out[33]:

MD	VA
year
2014	NaN	5.2
2015	4.1	NaN

Set the DataFrame columns name:

In [34]:

df_5.columns.name = 'state'
df_5

Out[34]:

state	MD	VA
year
2014	NaN	5.2
2015	4.1	NaN

Return the data contained in a DataFrame as a 2D ndarray:

In [35]:

df_5.values

Out[35]:

array([[ nan, 5.2],
 [ 4.1, nan]])

If the columns are different dtypes, the 2D ndarray's dtype will accomodate all of the columns:

In [36]:

df_3.values

Out[36]:

array([[2012, 'VA', 5.0, nan],
 [2013, 'VA', 5.1, nan],
 [2014, 'VA', 5.2, 6.0],
 [2014, 'MD', 4.0, 6.0],
 [2015, 'MD', 4.1, 6.1]], dtype=object)

Reindexing¶

Create a new object with the data conformed to a new index. Any missing values are set to NaN.

In [37]:

df_3

Out[37]:

year	state	pop	unempl
0	2012	VA	5.0	NaN
1	2013	VA	5.1	NaN
2	2014	VA	5.2	6.0
3	2014	MD	4.0	6.0
4	2015	MD	4.1	6.1

Reindexing rows returns a new frame with the specified index:

In [38]:

df_3.reindex(list(reversed(range(0, 6))))

Out[38]:

year	state	pop	unempl
5	NaN	NaN	NaN	NaN
4	2015	MD	4.1	6.1
3	2014	MD	4.0	6.0
2	2014	VA	5.2	6.0
1	2013	VA	5.1	NaN
0	2012	VA	5.0	NaN

Missing values can be set to something other than NaN:

In [39]:

df_3.reindex(range(6, 0), fill_value=0)

Out[39]:

year	state	pop	unempl

Interpolate ordered data like a time series:

In [40]:

ser_5 = Series(['foo', 'bar', 'baz'], index=[0, 2, 4])

In [41]:

ser_5.reindex(range(5), method='ffill')

Out[41]:

0 foo
1 foo
2 bar
3 bar
4 baz
dtype: object

In [42]:

ser_5.reindex(range(5), method='bfill')

Out[42]:

0 foo
1 bar
2 bar
3 baz
4 baz
dtype: object

Reindex columns:

In [43]:

df_3.reindex(columns=['state', 'pop', 'unempl', 'year'])

Out[43]:

state	pop	unempl	year
0	VA	5.0	NaN	2012
1	VA	5.1	NaN	2013
2	VA	5.2	6.0	2014
3	MD	4.0	6.0	2014
4	MD	4.1	6.1	2015

Reindex rows and columns while filling rows:

In [44]:

df_3.reindex(index=list(reversed(range(0, 6))),
 fill_value=0,
 columns=['state', 'pop', 'unempl', 'year'])

Out[44]:

state	pop	unempl	year
5	0	0.0	0.0	0
4	MD	4.1	6.1	2015
3	MD	4.0	6.0	2014
2	VA	5.2	6.0	2014
1	VA	5.1	NaN	2013
0	VA	5.0	NaN	2012

Reindex using ix:

In [45]:

df_6 = df_3.ix[range(0, 7), ['state', 'pop', 'unempl', 'year']]
df_6

Out[45]:

state	pop	unempl	year
0	VA	5.0	NaN	2012
1	VA	5.1	NaN	2013
2	VA	5.2	6.0	2014
3	MD	4.0	6.0	2014
4	MD	4.1	6.1	2015
5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN

Dropping Entries¶

Drop rows from a Series or DataFrame:

In [46]:

df_7 = df_6.drop([0, 1])
df_7

Out[46]:

state	pop	unempl	year
2	VA	5.2	6.0	2014
3	MD	4.0	6.0	2014
4	MD	4.1	6.1	2015
5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN

Drop columns from a DataFrame:

In [47]:

df_7 = df_7.drop('unempl', axis=1)
df_7

Out[47]:

state	pop	year
2	VA	5.2	2014
3	MD	4.0	2014
4	MD	4.1	2015
5	NaN	NaN	NaN
6	NaN	NaN	NaN

Indexing, Selecting, Filtering¶

Series indexing is similar to NumPy array indexing with the added bonus of being able to use the Series' index values.

In [48]:

ser_2

Out[48]:

a 1
b 1
c 2
d -3
e -5
dtype: int64

Select a value from a Series:

In [49]:

ser_2[0] == ser_2['a']

Out[49]:

True

Select a slice from a Series:

In [50]:

ser_2[1:4]

Out[50]:

b 1
c 2
d -3
dtype: int64

Select specific values from a Series:

In [51]:

ser_2[['b', 'c', 'd']]

Out[51]:

b 1
c 2
d -3
dtype: int64

Select from a Series based on a filter:

In [52]:

ser_2[ser_2 > 0]

Out[52]:

a 1
b 1
c 2
dtype: int64

Select a slice from a Series with labels (note the end point is inclusive):

In [53]:

ser_2['a':'b']

Out[53]:

a 1
b 1
dtype: int64

Assign to a Series slice (note the end point is inclusive):

In [54]:

ser_2['a':'b'] = 0
ser_2

Out[54]:

a 0
b 0
c 2
d -3
e -5
dtype: int64

Pandas supports indexing into a DataFrame.

In [55]:

df_6

Out[55]:

state	pop	unempl	year
0	VA	5.0	NaN	2012
1	VA	5.1	NaN	2013
2	VA	5.2	6.0	2014
3	MD	4.0	6.0	2014
4	MD	4.1	6.1	2015
5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN

Select specified columns from a DataFrame:

In [56]:

df_6[['pop', 'unempl']]

Out[56]:

pop	unempl
0	5.0	NaN
1	5.1	NaN
2	5.2	6.0
3	4.0	6.0
4	4.1	6.1
5	NaN	NaN
6	NaN	NaN

Select a slice from a DataFrame:

In [57]:

df_6[:2]

Out[57]:

state	pop	unempl	year
0	VA	5.0	NaN	2012
1	VA	5.1	NaN	2013

Select from a DataFrame based on a filter:

In [58]:

df_6[df_6['pop'] > 5]

Out[58]:

state	pop	unempl	year
1	VA	5.1	NaN	2013
2	VA	5.2	6	2014

Perform a scalar comparison on a DataFrame:

In [59]:

df_6 > 5

Out[59]:

state	pop	unempl	year
0	True	False	False	True
1	True	True	False	True
2	True	True	True	True
3	True	False	True	True
4	True	False	True	True
5	True	False	False	False
6	True	False	False	False

Perform a scalar comparison on a DataFrame, retain the values that pass the filter:

In [60]:

df_6[df_6 > 5]

Out[60]:

state	pop	unempl	year
0	VA	NaN	NaN	2012
1	VA	5.1	NaN	2013
2	VA	5.2	6.0	2014
3	MD	NaN	6.0	2014
4	MD	NaN	6.1	2015
5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN

Select a slice of rows from a DataFrame (note the end point is inclusive):

In [61]:

df_6.ix[2:3]

Out[61]:

state	pop	unempl	year
2	VA	5.2	6	2014
3	MD	4.0	6	2014

Select a slice of rows from a specific column of a DataFrame:

In [62]:

df_6.ix[0:2, 'pop']

Out[62]:

0 5.0
1 5.1
2 5.2
Name: pop, dtype: float64

Select rows based on an arithmetic operation on a specific row:

In [63]:

df_6.ix[df_6.unempl > 5.0]

Out[63]:

state	pop	unempl	year
2	VA	5.2	6.0	2014
3	MD	4.0	6.0	2014
4	MD	4.1	6.1	2015

Arithmetic and Data Alignment¶

Adding Series objects results in the union of index pairs if the pairs are not the same, resulting in NaN for indices that do not overlap:

In [64]:

np.random.seed(0)
ser_6 = Series(np.random.randn(5),
 index=['a', 'b', 'c', 'd', 'e'])
ser_6

Out[64]:

a 1.764052
b 0.400157
c 0.978738
d 2.240893
e 1.867558
dtype: float64

In [65]:

np.random.seed(1)
ser_7 = Series(np.random.randn(5),
 index=['a', 'c', 'e', 'f', 'g'])
ser_7

Out[65]:

a 1.624345
c -0.611756
e -0.528172
f -1.072969
g 0.865408
dtype: float64

In [66]:

ser_6 + ser_7

Out[66]:

a 3.388398
b NaN
c 0.366982
d NaN
e 1.339386
f NaN
g NaN
dtype: float64

Set a fill value instead of NaN for indices that do not overlap:

In [67]:

ser_6.add(ser_7, fill_value=0)

Out[67]:

a 3.388398
b 0.400157
c 0.366982
d 2.240893
e 1.339386
f -1.072969
g 0.865408
dtype: float64

Adding DataFrame objects results in the union of index pairs for rows and columns if the pairs are not the same, resulting in NaN for indices that do not overlap:

In [68]:

np.random.seed(0)
df_8 = DataFrame(np.random.rand(9).reshape((3, 3)),
 columns=['a', 'b', 'c'])
df_8

Out[68]:

a	b	c
0	0.548814	0.715189	0.602763
1	0.544883	0.423655	0.645894
2	0.437587	0.891773	0.963663

In [69]:

np.random.seed(1)
df_9 = DataFrame(np.random.rand(9).reshape((3, 3)),
 columns=['b', 'c', 'd'])
df_9

Out[69]:

b	c	d
0	0.417022	0.720324	0.000114
1	0.302333	0.146756	0.092339
2	0.186260	0.345561	0.396767

In [70]:

df_8 + df_9

Out[70]:

a	b	c	d
0	NaN	1.132211	1.323088	NaN
1	NaN	0.725987	0.792650	NaN
2	NaN	1.078033	1.309223	NaN

Set a fill value instead of NaN for indices that do not overlap:

In [71]:

df_10 = df_8.add(df_9, fill_value=0)
df_10

Out[71]:

a	b	c	d
0	0.548814	1.132211	1.323088	0.000114
1	0.544883	0.725987	0.792650	0.092339
2	0.437587	1.078033	1.309223	0.396767

Like NumPy, pandas supports arithmetic operations between DataFrames and Series.

Match the index of the Series on the DataFrame's columns, broadcasting down the rows:

In [72]:

ser_8 = df_10.ix[0]
df_11 = df_10 - ser_8
df_11

Out[72]:

a	b	c	d
0	0.000000	0.000000	0.000000	0.000000
1	-0.003930	-0.406224	-0.530438	0.092224
2	-0.111226	-0.054178	-0.013864	0.396653

Match the index of the Series on the DataFrame's columns, broadcasting down the rows and union the indices that do not match:

In [73]:

ser_9 = Series(range(3), index=['a', 'd', 'e'])
ser_9

Out[73]:

a 0
d 1
e 2
dtype: int64

In [74]:

df_11 - ser_9

Out[74]:

a	b	c	d	e
0	0.000000	NaN	NaN	-1.000000	NaN
1	-0.003930	NaN	NaN	-0.907776	NaN
2	-0.111226	NaN	NaN	-0.603347	NaN

Broadcast over the columns and match the rows (axis=0) by using an arithmetic method:

In [75]:

df_10

Out[75]:

a	b	c	d
0	0.548814	1.132211	1.323088	0.000114
1	0.544883	0.725987	0.792650	0.092339
2	0.437587	1.078033	1.309223	0.396767

In [76]:

ser_10 = Series([100, 200, 300])
ser_10

Out[76]:

0 100
1 200
2 300
dtype: int64

In [77]:

df_10.sub(ser_10, axis=0)

Out[77]:

a	b	c	d
0	-99.451186	-98.867789	-98.676912	-99.999886
1	-199.455117	-199.274013	-199.207350	-199.907661
2	-299.562413	-298.921967	-298.690777	-299.603233

Function Application and Mapping¶

NumPy ufuncs (element-wise array methods) operate on pandas objects:

In [78]:

df_11 = np.abs(df_11)
df_11

Out[78]:

a	b	c	d
0	0.000000	0.000000	0.000000	0.000000
1	0.003930	0.406224	0.530438	0.092224
2	0.111226	0.054178	0.013864	0.396653

Apply a function on 1D arrays to each column:

In [79]:

func_1 = lambda x: x.max() - x.min()
df_11.apply(func_1)

Out[79]:

a 0.111226
b 0.406224
c 0.530438
d 0.396653
dtype: float64

Apply a function on 1D arrays to each row:

In [80]:

df_11.apply(func_1, axis=1)

Out[80]:

0 0.000000
1 0.526508
2 0.382789
dtype: float64

Apply a function and return a DataFrame:

In [81]:

func_2 = lambda x: Series([x.min(), x.max()], index=['min', 'max'])
df_11.apply(func_2)

Out[81]:

a	b	c	d
min	0.000000	0.000000	0.000000	0.000000
max	0.111226	0.406224	0.530438	0.396653

Apply an element-wise Python function to a DataFrame:

In [82]:

func_3 = lambda x: '%.2f' %x
df_11.applymap(func_3)

Out[82]:

a	b	c	d
0	0.00	0.00	0.00	0.00
1	0.00	0.41	0.53	0.09
2	0.11	0.05	0.01	0.40

Apply an element-wise Python function to a Series:

In [83]:

df_11['a'].map(func_3)

Out[83]:

0 0.00
1 0.00
2 0.11
Name: a, dtype: object

Sorting and Ranking¶

In [84]:

ser_4

Out[84]:

fo 100
br 200
bz 300
qx NaN
Name: foobarbazqux, dtype: float64

Sort a Series by its index:

In [85]:

ser_4.sort_index()

Out[85]:

br 200
bz 300
fo 100
qx NaN
Name: foobarbazqux, dtype: float64

Sort a Series by its values:

In [86]:

ser_4.sort_values()

Out[86]:

fo 100
br 200
bz 300
qx NaN
Name: foobarbazqux, dtype: float64

In [87]:

df_12 = DataFrame(np.arange(12).reshape((3, 4)),
 index=['three', 'one', 'two'],
 columns=['c', 'a', 'b', 'd'])
df_12

Out[87]:

c	a	b	d
three	0	1	2	3
one	4	5	6	7
two	8	9	10	11

Sort a DataFrame by its index:

In [88]:

df_12.sort_index()

Out[88]:

c	a	b	d
one	4	5	6	7
three	0	1	2	3
two	8	9	10	11

Sort a DataFrame by columns in descending order:

In [89]:

df_12.sort_index(axis=1, ascending=False)

Out[89]:

d	c	b	a
three	3	0	2	1
one	7	4	6	5
two	11	8	10	9

Sort a DataFrame's values by column:

In [90]:

df_12.sort_values(by=['d', 'c'])

Out[90]:

c	a	b	d
three	0	1	2	3
one	4	5	6	7
two	8	9	10	11

Ranking is similar to numpy.argsort except that ties are broken by assigning each group the mean rank:

In [91]:

ser_11 = Series([7, -5, 7, 4, 2, 0, 4, 7])
ser_11 = ser_11.sort_values()
ser_11

Out[91]:

1 -5
5 0
4 2
3 4
6 4
0 7
2 7
7 7
dtype: int64

In [92]:

ser_11.rank()

Out[92]:

1 1.0
5 2.0
4 3.0
3 4.5
6 4.5
0 7.0
2 7.0
7 7.0
dtype: float64

Rank a Series according to when they appear in the data:

In [93]:

ser_11.rank(method='first')

Out[93]:

1 1
5 2
4 3
3 4
6 5
0 6
2 7
7 8
dtype: float64

Rank a Series in descending order, using the maximum rank for the group:

In [94]:

ser_11.rank(ascending=False, method='max')

Out[94]:

1 8
5 7
4 6
3 5
6 5
0 3
2 3
7 3
dtype: float64

DataFrames can rank over rows or columns.

In [95]:

df_13 = DataFrame({'foo' : [7, -5, 7, 4, 2, 0, 4, 7],
 'bar' : [-5, 4, 2, 0, 4, 7, 7, 8],
 'baz' : [-1, 2, 3, 0, 5, 9, 9, 5]})
df_13

Out[95]:

bar	baz	foo
0	-5	-1	7
1	4	2	-5
2	2	3	7
3	0	0	4
4	4	5	2
5	7	9	0
6	7	9	4
7	8	5	7

Rank a DataFrame over rows:

In [96]:

df_13.rank()

Out[96]:

bar	baz	foo
0	1.0	1.0	7.0
1	4.5	3.0	1.0
2	3.0	4.0	7.0
3	2.0	2.0	4.5
4	4.5	5.5	3.0
5	6.5	7.5	2.0
6	6.5	7.5	4.5
7	8.0	5.5	7.0

Rank a DataFrame over columns:

In [97]:

df_13.rank(axis=1)

Out[97]:

bar	baz	foo
0	1.0	2.0	3
1	3.0	2.0	1
2	1.0	2.0	3
3	1.5	1.5	3
4	2.0	3.0	1
5	2.0	3.0	1
6	2.0	3.0	1
7	3.0	1.0	2

Axis Indexes with Duplicate Values¶

Labels do not have to be unique in Pandas:

In [98]:

ser_12 = Series(range(5), index=['foo', 'foo', 'bar', 'bar', 'baz'])
ser_12

Out[98]:

foo 0
foo 1
bar 2
bar 3
baz 4
dtype: int64

In [99]:

ser_12.index.is_unique

Out[99]:

False

Select Series elements:

In [100]:

ser_12['foo']

Out[100]:

foo 0
foo 1
dtype: int64

Select DataFrame elements:

In [101]:

df_14 = DataFrame(np.random.randn(5, 4),
 index=['foo', 'foo', 'bar', 'bar', 'baz'])
df_14

Out[101]:

0	1	2	3
foo	-2.363469	1.135345	-1.017014	0.637362
foo	-0.859907	1.772608	-1.110363	0.181214
bar	0.564345	-0.566510	0.729976	0.372994
bar	0.533811	-0.091973	1.913820	0.330797
baz	1.141943	-1.129595	-0.850052	0.960820

In [102]:

df_14.ix['bar']

Out[102]:

0	1	2	3
bar	0.564345	-0.566510	0.729976	0.372994
bar	0.533811	-0.091973	1.913820	0.330797

Summarizing and Computing Descriptive Statistics¶

Unlike NumPy arrays, Pandas descriptive statistics automatically exclude missing data. NaN values are excluded unless the entire row or column is NA.

In [103]:

df_6

Out[103]:

state	pop	unempl	year
0	VA	5.0	NaN	2012
1	VA	5.1	NaN	2013
2	VA	5.2	6.0	2014
3	MD	4.0	6.0	2014
4	MD	4.1	6.1	2015
5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN

In [104]:

df_6.sum()

Out[104]:

pop 23.4
unempl 18.1
year 10068.0
dtype: float64

Sum over the rows:

In [105]:

df_6.sum(axis=1)

Out[105]:

0 2017.0
1 2018.1
2 2025.2
3 2024.0
4 2025.2
5 0.0
6 0.0
dtype: float64

Account for NaNs:

In [106]:

df_6.sum(axis=1, skipna=False)

Out[106]:

0 NaN
1 NaN
2 2025.2
3 2024.0
4 2025.2
5 NaN
6 NaN
dtype: float64

Cleaning Data (Under Construction)¶

Replace
Drop
Concatenate

In [107]:

frompandasimport Series, DataFrame
importpandasaspd

Setup a DataFrame:

In [108]:

data_1 = {'state' : ['VA', 'VA', 'VA', 'MD', 'MD'],
 'year' : [2012, 2013, 2014, 2014, 2015],
 'population' : [5.0, 5.1, 5.2, 4.0, 4.1]}
df_1 = DataFrame(data_1)
df_1

Out[108]:

population	state	year
0	5.0	VA	2012
1	5.1	VA	2013
2	5.2	VA	2014
3	4.0	MD	2014
4	4.1	MD	2015

Replace¶

Replace all occurrences of a string with another string, in place (no copy):

In [109]:

df_1.replace('VA', 'VIRGINIA', inplace=True)
df_1

Out[109]:

population	state	year
0	5.0	VIRGINIA	2012
1	5.1	VIRGINIA	2013
2	5.2	VIRGINIA	2014
3	4.0	MD	2014
4	4.1	MD	2015

In a specified column, replace all occurrences of a string with another string, in place (no copy):

In [110]:

df_1.replace({'state' : { 'MD' : 'MARYLAND' }}, inplace=True)
df_1

Out[110]:

population	state	year
0	5.0	VIRGINIA	2012
1	5.1	VIRGINIA	2013
2	5.2	VIRGINIA	2014
3	4.0	MARYLAND	2014
4	4.1	MARYLAND	2015

Drop¶

Drop the 'population' column and return a copy of the DataFrame:

In [111]:

df_2 = df_1.drop('population', axis=1)
df_2

Out[111]:

state	year
0	VIRGINIA	2012
1	VIRGINIA	2013
2	VIRGINIA	2014
3	MARYLAND	2014
4	MARYLAND	2015

Concatenate¶

Concatenate two DataFrames:

In [112]:

data_2 = {'state' : ['NY', 'NY', 'NY', 'FL', 'FL'],
 'year' : [2012, 2013, 2014, 2014, 2015],
 'population' : [6.0, 6.1, 6.2, 3.0, 3.1]}
df_3 = DataFrame(data_2)
df_3

Out[112]:

population	state	year
0	6.0	NY	2012
1	6.1	NY	2013
2	6.2	NY	2014
3	3.0	FL	2014
4	3.1	FL	2015

In [113]:

df_4 = pd.concat([df_1, df_3])
df_4

Out[113]:

population	state	year
0	5.0	VIRGINIA	2012
1	5.1	VIRGINIA	2013
2	5.2	VIRGINIA	2014
3	4.0	MARYLAND	2014
4	4.1	MARYLAND	2015
0	6.0	NY	2012
1	6.1	NY	2013
2	6.2	NY	2014
3	3.0	FL	2014
4	3.1	FL	2015

Input and Output (Under Construction)¶

Reading
Writing

In [114]:

frompandasimport Series, DataFrame
importpandasaspd

Reading¶

Read data from a CSV file into a DataFrame (use sep='\t' for TSV):

In [115]:

df_1 = pd.read_csv("../data/ozone.csv")

Get a summary of the DataFrame:

In [116]:

df_1.describe()

Out[116]:

Ozone	Solar.R	Wind	Temp	Month	Day
count	116.000000	146.000000	153.000000	153.000000	153.000000	153.000000
mean	42.129310	185.931507	9.957516	77.882353	6.993464	15.803922
std	32.987885	90.058422	3.523001	9.465270	1.416522	8.864520
min	1.000000	7.000000	1.700000	56.000000	5.000000	1.000000
25%	18.000000	115.750000	7.400000	72.000000	6.000000	8.000000
50%	31.500000	205.000000	9.700000	79.000000	7.000000	16.000000
75%	63.250000	258.750000	11.500000	85.000000	8.000000	23.000000
max	168.000000	334.000000	20.700000	97.000000	9.000000	31.000000

List the first five rows of the DataFrame:

In [117]:

df_1.head()

Out[117]:

Ozone	Solar.R	Wind	Temp	Month	Day
0	41	190	7.4	67	5	1
1	36	118	8.0	72	5	2
2	12	149	12.6	74	5	3
3	18	313	11.5	62	5	4
4	NaN	NaN	14.3	56	5	5

Writing¶

Create a copy of the CSV file, encoded in UTF-8 and hiding the index and header labels:

In [118]:

df_1.to_csv('../data/ozone_copy.csv', 
 encoding='utf-8', 
 index=False, 
 header=False)

View the data directory:

In [119]:

!ls-l../data/

total 1016
-rw-r--r-- 1 donnemartin staff 437903 Jul 7 2015 churn.csv
-rwxr-xr-x 1 donnemartin staff 72050 Jul 7 2015 confusion_matrix.png
-rw-r--r-- 1 donnemartin staff 2902 Jul 7 2015 ozone.csv
-rw-r--r-- 1 donnemartin staff 3324 Apr 1 07:18 ozone_copy.csv
drwxr-xr-x 10 donnemartin staff 340 Jul 7 2015 titanic

bar	baz	foo
0	1.0	1.0	7.0
1	4.5	3.0	1.0
2	3.0	4.0	7.0
3	2.0	2.0	4.5
4	4.5	5.5	3.0
5	6.5	7.5	2.0
6	6.5	7.5	4.5
7	8.0	5.5	7.0

bar	baz	foo
0	1.0	2.0	3
1	3.0	2.0	1
2	1.0	2.0	3
3	1.5	1.5	3
4	2.0	3.0	1
5	2.0	3.0	1
6	2.0	3.0	1
7	3.0	1.0	2

bar	baz	foo
0	1.0	1.0	7.0
1	4.5	3.0	1.0
2	3.0	4.0	7.0
3	2.0	2.0	4.5
4	4.5	5.5	3.0
5	6.5	7.5	2.0
6	6.5	7.5	4.5
7	8.0	5.5	7.0

bar	baz	foo
0	1.0	2.0	3
1	3.0	2.0	1
2	1.0	2.0	3
3	1.5	1.5	3
4	2.0	3.0	1
5	2.0	3.0	1
6	2.0	3.0	1
7	3.0	1.0	2

bar	baz	foo
0	1.0	1.0	7.0
1	4.5	3.0	1.0
2	3.0	4.0	7.0
3	2.0	2.0	4.5
4	4.5	5.5	3.0
5	6.5	7.5	2.0
6	6.5	7.5	4.5
7	8.0	5.5	7.0

bar	baz	foo
0	1.0	2.0	3
1	3.0	2.0	1
2	1.0	2.0	3
3	1.5	1.5	3
4	2.0	3.0	1
5	2.0	3.0	1
6	2.0	3.0	1
7	3.0	1.0	2