Is the following an acceptable way to change an object column to numeric if cardinality threshold (cardinality_threshold) is being breached? I think it would be good to check if column values are numeric though. Thanks!
data = {
'col1':['1', '2', '3'],
'col2':['1', '1', '2'],
}
df = pd.DataFrame(data)
df
cardinality_threshold = 2
cols = [i for i in df.columns]
for i in cols:
if(len(df[i].value_counts()) > cardinality_threshold) :
df[i] = pd.to_numeric(df[i]) # , errors = 'coerce'
print(df.info())
1 Answer 1
- Use
DataFrame.nunique
to vectorize the cardinality check - Use
DataFrame.apply
to convertto_numeric
(not vectorized, but more idiomatic than loops) - Use uppercase for
CARDINALITY_THRESHOLD
per PEP8 style for constants
CARDINALITY_THRESHOLD = 2
breached = df.columns[df.nunique() > CARDINALITY_THRESHOLD]
df[breached] = df[breached].apply(pd.to_numeric, errors='coerce')
>>> df.dtypes
# col1 int64
# col2 object
# dtype: object
I think it would be good to check if column values are numeric though.
Note that to_numeric
already skips numeric columns, so it's simplest to just let pandas handle it.
If you still want to explicitly exclude numeric columns:
- Use
DataFrame.select_dtypes
to get the non-numeric columns - Use
Index.intersection
to get the non-numeric breached columns
breached = df.columns[df.nunique() > CARDINALITY_THRESHOLD]
non_numeric = df.select_dtypes(exclude='number').columns
cols = non_numeric.intersection(breached)
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')