I'm reading and processing a fairly large csv using Pandas and Python 3.7. Header names in the CSV have periods in them ('full stops', Britons say). That's a problem when you want to address data cells by column name.
test.csv:
"name","birth.place","not.important"
"John","",""
"Paul","Liverpool","blue"
# -*- coding: utf-8 -*-
import pandas as pd
infile = 'test.csv'
useful_cols = ['name', 'birth.place']
df = pd.read_csv(infile, usecols=useful_cols, encoding='utf-8-sig', engine='python')
# replace '.' by '_'
df.columns = df.columns.str.replace('.', '_')
# we may want to iterate over useful_cols later, so to keep things consistent:
useful_cols = [s.replace('', '') for s in useful_cols]
# now we can do this..
print(df['birth_place'])
# ... and this
for row in df.itertuples():
print(row.birth_place)
# ain't that nice?
It works, but since Pandas is such a powerful library and the use case is quite common, I'm wondering if there isn't an even better way of doing this.
1 Answer 1
Did a little digging and found that you can use df._columnid
when pandas df.columns
runs into an issue with a name (in this example dealing with a "."
)
I am sure you already know that you could just do df['birth.place']
, since it's inside a string container, however it becomes tricky for row.birth.place
as you mentioned. For that you can do the following:
for row in df.itertuples():
print(row._2)
The _2
corresponds to the column id that pandas had issues parsing. It renamed it with an underscore and enumerated id in the column's list. Note that this renaming process only occurs when pandas ran into an issue grabbing the actual column name (i.e. row.name
is still row.name
, and you cannot use row._1
in-place of it). Hope that helps! Happy pythoning!
-
\$\begingroup\$ Thanks. I didn't mention what I had already found out: df[
birth.place
] only works on entire columns, not on cells.getattr(row, 'birth.place') doesn't work because the column is renamed, and
row.birth.place` errorshas no attribute 'birth'
. \$\endgroup\$RolfBly– RolfBly2018年07月19日 08:15:04 +00:00Commented Jul 19, 2018 at 8:15 -
\$\begingroup\$ Right, getattr() would work the same then. You would say getattr(row, "_2"), but this is equivalent to saying row._2 \$\endgroup\$PydPiper– PydPiper2018年07月19日 13:49:30 +00:00Commented Jul 19, 2018 at 13:49
csv
library, because it has all these powerful features that I'm keen to explore. In a world without Pandas, I'd have certainly gone forcsv
. \$\endgroup\$