pandas¶

pandas特征与导入¶

包含高级的数据结构和精巧的工具
pandas建造在NumPy之上
导入:

from pandas import Series, DataFrame
import pandas as pd

pandas数据结构¶

SERIES

一维的类似的数组对象

包含一个数组的数据(任何NumPy的数据类型)和一个与数组关联的索引

不指定索引:a = Series([1,2,3]) ,输出为

0 1
1 2
2 3

包含属性a.index,a.values,对应索引和值

指定索引:a = Series([1,2,3],index=['a','b','c'])

可以通过索引访问a['b']

判断某个索引是否存在:'b' in a

通过字典建立Series

dict = {'china':10,'america':30,'indian':20}
print Series(dict)

输出:

america 30
china 10
indian 20
dtype: int64

判断哪个索引值缺失:

dict = {'china':10,'america':30,'indian':20}
state = ['china','america','test']
a = Series(dict,state)
print a.isnull()

输出:(test索引没有对应值)

china False
america False
test True
dtype: bool

在算术运算中它会自动对齐不同索引的数据

a = Series([10,20],['china','test'])
b = Series([10,20],['test','china'])
print a+b

输出:

china 30
test 30
dtype: int64

指定Series对象的name和index的name属性

a = Series([10,20],['china','test'])
a.index.name = 'state'
a.name = 'number'
print a

输出:

state
china 10
test 20
Name: number, dtype: int64

DATAFRAME

Datarame表示一个表格,类似电子表格的数据结构

包含一个经过排序的列表集(按列名排序)

每一个都可以有不同的类型值(数字,字符串,布尔等等)

DataFrame在内部把数据存储为一个二维数组的格式,因此你可以采用分层索引以表格格式来表示高维的数据

创建:

通过字典

data = {'state': ['a', 'b', 'c', 'd', 'd'],
 'year': [2000, 2001, 2002, 2001, 2002],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print frame

输出:(按照列名排好序的[若是手动分配列名,会按照你设定的],并且索引会自动分配)

 pop state year
0 1.5 a 2000
1 1.7 b 2001
2 3.6 c 2002
3 2.4 d 2001
4 2.9 d 2002

访问

列:与Series一样,通过列名访问:frame['state']或者frame.state

行:ix 索引成员(field),frame.ix[2],返回每一列的第3行数据

赋值:`frame2['debt'] = np.arange(5.),若没有debt列名,则会新增一列

删除某一列:`del frame2['eastern']

像Series一样, values 属性返回一个包含在DataFrame中的数据的二维ndarray

返回所有的列信息:frame.columns

转置:frame2.T

索引对象

pandas的索引对象用来保存坐标轴标签和其它元数据(如坐标轴名或名称)

索引对象是不可变的,因此不能由用户改变

创建index = pd.Index([1,2,3])

常用操作

append–>链接额外的索引对象,产生一个新的索引

diff –>计算索引的差集

intersection –>计算交集

union –>计算并集

isin –>计算出一个布尔数组表示每一个值是否包含在所传递的集合里

delete –>计算删除位置i的元素的索引

drop –>计算删除所传递的值后的索引

insert –>计算在位置i插入元素后的索引

is_monotonic –>返回True,如果每一个元素都比它前面的元素大或相等

is_unique –>返回True,如果索引没有重复的值

unique –>计算索引的唯一值数组

重新索引reindex¶

SERIES

重新排列

a = Series([2,3,1],index=['b','a','c'])
b = a.reindex(['a','b','c'])
print b

重新排列,没有的索引补充为0,b=a.reindex(['a','b','c','d'],fill_value=0)
重建索引时对值进行内插或填充

a = Series(['a','b','c'],index=[0,2,4])
b = a.reindex(range(6),method='ffill')
print b

输出:

0 a
1 a
2 b
3 b
4 c
5 cdata_link
dtype: object

method的参数

ffill或pad—->前向(或进位)填充

bfill或backfill—->后向(或进位)填充

DATAFRAME

与Series一样,reindex index 还可以reindex column列,frame.reindex(columns=['a','b'])

从一个坐标轴删除条目¶

SERIES

a.drop(['a','b']) 删除a,b索引项

DATAFRAME

索引项的删除与Series一样

删除column—>a.drop(['one'], axis=1) 删除column名为one的一列

索引,挑选和过滤¶

SERIES

可以通过index值或者整数值来访问数据,eg:对于a = Series(np.arange(4.), index=['a', 'b', 'c', 'd']),a['b']和a[1]是一样的使用标签来切片和正常的Python切片并不一样,它会把结束点也包括在内

a = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print a['b':'c']

输出包含c索引对应的值

DATAFRAME

显示前两行:a[:2] 布尔值访问:a[a['two']>5] 索引字段 ix 的使用 index为2,column为’one’和’two’—>a.ix[[2],['one','two']] index为2的一行:a.ix[2]

DataFrame和Series运算¶

DataFrame每一行都减去一个Series

a = pd.DataFrame(np.arange(16).reshape(4,4),index=[0,1,2,3],columns=['one', 'two','three','four'])
print a
b = Series([0,1,2,3],index=['one','two','three','four'])
print b
print a-b

输出:

 one two three four
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
one 0
two 1
three 2
four 3
dtype: int64
 one two three four
0 0 0 0 0
1 4 4 4 4
2 8 8 8 8
3 12 12 12 12

读取文件¶

csv文件 pd.read_csv(r"data/train.csv"),返回的数据类型是DataFrame类型

查看DataFrame的信息¶

train_data.describe()

 PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 714.000000 891.000000 
mean 446.000000 0.383838 2.308642 29.699118 0.523008 
std 257.353842 0.486592 0.836071 14.526497 1.102743 
min 1.000000 0.000000 1.000000 0.420000 0.000000 
25% 223.500000 0.000000 2.000000 20.125000 0.000000 
50% 446.000000 0.000000 3.000000 28.000000 0.000000 
75% 668.500000 1.000000 3.000000 38.000000 1.000000 
max 891.000000 1.000000 3.000000 80.000000 8.000000

定位到一列并替换¶

df.loc[df.Age.isnull(),'Age'] = 23 #'Age'列为空的内容补上数字23

将分类变量转化为指示变量`get_dummies()`¶

s = pd.Series(list('abca'))
pd.get_dummies(s)

list和string互相转化¶

string转list

>>> str = 'abcde'
>>> list = list(str)
>>> list
['a', 'b', 'c', 'd', 'e']

list转string

>>> str_convert = ','.join(list)
>>> str_convert
'a,b,c,d,e'

删除原来的索引,重新从0-n索引¶

x = x.reset_index(drop=True)

apply函数¶

DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, .....

df.apply(numpy.sqrt) # returns DataFrame

等价==》df.apply(lambda x : numpy.sqrt(x))==>使用更灵活

df.apply(numpy.sum, axis=0) # equiv to df.sum(0)

df.apply(numpy.sum, axis=1) # equiv to df.sum(1)

`re.search().group()`函数¶

re.search(pattern, string, flags=0)

group(num=0)函数返回匹配的字符,默认num=0,可以指定多个组号,例如group(0,1)

pandas.cut()函数¶

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)

x为一维数组
bins可以是int值或者序列
- 若是int值就根据x分为bins个数的区间
- 若是序列就是自己指定的区间
right包含最右边的区间,默认为True
labels 数组或者一个布尔值
- 若是数组,需要与对应bins的结果一致
- 若是布尔值False,返回bin中的一个值

eg:pd.cut(full["FamilySize"], bins=[0,1,4,20], labels=[0,1,2])

添加一行数据¶

定义空的dataframe: data_process = pd.DataFrame(columns=['route','date','1','2','3','4','5','6','7','8','9','10','11','12'])

定义一行新的数据,new = pd.DataFrame(columns=['route','date','1','2','3','4','5','6','7','8','9','10','11','12'],index=[j])

这里index可以随意设置,若是想指定就指定

添加:data_process = data_process.append(new, ignore_index=True),注意这里是data_process = data_process.......