I find myself often having to check whether a column or row exists in a dataframe before trying to reference it. For example I end up adding a lot of code like:
if 'mycol' in df.columns and 'myindex' in df.index:
x = df.loc[myindex, mycol]
else:
x = mydefault
Is there any way to do this more nicely? For example on an arbitrary object I can do x = getattr(anobject, 'id', default)
- is there anything similar to this in pandas? Really any way to achieve what I'm doing more gracefully?
5 Answers 5
There is a method for Series
:
So you could do:
df.mycol.get(myIndex, NaN)
Example:
In [117]:
df = pd.DataFrame({'mycol':arange(5), 'dummy':arange(5)})
df
Out[117]:
dummy mycol
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
[5 rows x 2 columns]
In [118]:
print(df.mycol.get(2, NaN))
print(df.mycol.get(5, NaN))
2
nan
1 Comment
df.loc['myindex'].get('mycol', NaN)
A shame that you still need to be sure that one of the index or column exists, but nonetheless this will be useful in a lot of scenarios. Thank you!Python has this mentality to ask for forgiveness instead of permission. You'll find a lot of posts on this matter, such as this one.
In Python catching exceptions is relatively inexpensive, so you're encouraged to use it. This is called the EAFP approach.
For example:
try:
x = df.loc['myindex', 'mycol']
except KeyError:
x = mydefault
2 Comments
try:
that is inexpensive. except:
seems to be expensive. The moral of the story seems to be that the caller is left to decide between testing for existence or try: except:
ing. The performance trade off depending on your use case. i.e. how long it takes to test existence vs how many times not testing will raise
. Nevertheless, it would be nice if pandas offered syntactic sugar by permitting that choice to be argument driven. As far as I can tell, it does not.There is the get
method for DataFrame
to get a column and another get
for Series
to get an item. So you can chain them together to get a single value:
A B
0 0 2
1 1 3
df.get('B', default=pd.Series()).get(1, default='[unknown]')
Output:
3
If the index or column is missing:
df.get('B', default=pd.Series()).get(2, default='[unknown]')
# or
df.get('C', default=pd.Series()).get(1, default='[unknown]')
Output:
'[unknown]'
Comments
Use reindex
:
df.reindex(index=['row1', 'row2'], columns=['col1', 'col2'], fill_value=mydefault)
What's great here is using lists for the rows and columns, where some of them exist and some of them don't, and you get the fallback value whenever either the row or column is missing.
Example:
In[1]:
df = pd.DataFrame({
'A':[1, 2, 3],
'B':[5, 3, 7],
})
df
Out[1]:
A B
0 1 5
1 2 3
2 3 7
In[2]:
df.reindex(index=[0, 1, 100], columns=['A', 'C'], fill_value='FV')
Out[2]:
A C
0 1 FV
1 2 FV
100 FV FV
2 Comments
Define Function
# Define Function:
def getvalue(df,index,column_key,default_value):
try:
return df.loc[index,column_key]
except KeyError:
return default_value
Example:
# define dictionary
thisdict = {
"brand": ["Ford",'Honda','Toyta'],
"model": ["Mustang",'CRV','Camry'],
"year": [1964,2004,1892 ]
}
# create dataframe
df = pd.DataFrame(thisdict)
# print dataframe
print(df )
print()
# Test all 4 scenarios
colNotFound = getvalue(df,1,'name',"ColNotFound")
print(colNotFound + '\n')
indexNotFound = getvalue(df, 4,'model',"indexNotFound")
print(indexNotFound + '\n')
colandindexNotFound = getvalue(df, 4,'name',"colandindexNotFound")
print(colandindexNotFound + '\n')
keyandcolindf = getvalue(df, 1,'model',"Nothing")
print(keyandcolindf + '\n')
output:
brand model year
0 Ford Mustang 1964
1 Honda CRV 2004
2 Toyta Camry 1892
ColNotFound
indexNotFound
colandindexNotFound
CRV