I am cleaning str columns in a Pandas dataframe (see below for an example), and I have been wondering if there are more concise ways or additional inplace methods to do so. What are the general best practices for cleaning columns in Pandas?
import pandas as pd
df = pd.DataFrame.from_dict({"col1": [0, 1, 2, 3], "col2": ["abcd efg", ".%ues", "t12 ^&3", "yupe"]})
df["col2"] = df["col2"].str.lower()
df["col2"] = df["col2"].str.strip()
df["col2"].replace(to_replace="[^a-zA-Z ]", value="", regex=True, inplace=True)
1 Answer 1
This is not too bad. It's a good thing you use keyword arguments for the replace method
I always try to keep my original data in its original state, and continue with the cleaned dataframe.
fluent style
This lends itself very well to a kind of fluent style as in this example. I use it too, and use a lot of df.assign
, df.pipe
, df.query
...
In this example I would do something like
df_cleaned = df.assign(
col2=(
df["col2"]
.str.lower()
.str.strip()
.replace(to_replace="[^a-zA-Z ]", value="", regex=True)
)
)
So definately no inplace
replacements