I have a pandas dataframe, df1:
id level val
1 l1 0.2
1 l2 0.1
and another dataframe which contains type for l1 level, df2:
id type
1 A
Similarly, df3 which contains type for l2 level as well as new ids:
id new_id type
1 19 A
1 19 B
1 20 A
1 20 B
Now I want type for l1 level from df2. And for l2 level from df3, while also resetting ids and level. This is what I am doing:
df2['level'] = 'l1'
df3['level'] = 'l2'
df1 = pd.merge(df1, df2, how='outer', on=['id','level'])
df1 = pd.merge(df1, df3, how='outer', on=['id','level'], suffixes=["_", ""])
df1['id'] = np.where(df1['new_id'].isnull(), df1['id'], df1['new_id'])
df1['type'] = np.where(df1['type'].isnull(), df1['type_'], df1['type'])
df1['level'] = np.where(df1['level']=='l2', 'l3', df1['level'])
df1 = df1.drop('new_id', 1)
df1 = df1.drop('type_', 1)
Which finally gives me, df1:
id level val type
1 l1 0.2 A
19 l3 0.1 A
19 l3 0.1 B
20 l3 0.1 A
20 l3 0.1 B
Since df1 is a large dataframe, I want to avoid using np.where thrice, i.e., I am looking for a faster approach to get this. Thanks!
1 Answer 1
keyword arguments
I had to check the documentation to find out what the 1 means in df1.drop('new_id', 1). More clear would be df1.drop('new_id', axis=1), or even better: df1.drop(columns=["type_", "new_id"])
Don't overwrite raw data
You add columns and values to df1, which makes finding problems harder.
With a bit of reshuffling, and using df.join instead of pd.merge, you can make your intent a bit more clear
df6= df1.join(
df2.assign(level="l1").set_index(["id", "level"]),
how="outer",
on=["id", "level"],
).join(
df3.assign(level="l2").set_index(["id", "level"]),
how="outer",
on=["id", "level"],
lsuffix="_",
)
np.where
If you're looking for null, you can use fillna or combine_first. To replace 'l2' with 'l3', you can use Series.replace.
df7 = df6.assign(
id=df6["new_id"].fillna(df6["id"]).astype(int),
type=df6["type_"].fillna(df6["type"]),
level=df6["level"].replace("l2", "l3")
).drop(columns=["type_", "new_id"])
I don't know whether this approach will be faster, but to me, what happens here is a lot more clear.