I have a pandas dataframe, df1
:
id level val
1 l1 0.2
1 l2 0.1
and another dataframe which contains type for l1
level, df2
:
id type
1 A
Similarly, df3
which contains type for l2
level as well as new ids:
id new_id type
1 19 A
1 19 B
1 20 A
1 20 B
Now I want type for l1
level from df2
. And for l2
level from df3
, while also resetting ids and level. This is what I am doing:
df2['level'] = 'l1'
df3['level'] = 'l2'
df1 = pd.merge(df1, df2, how='outer', on=['id','level'])
df1 = pd.merge(df1, df3, how='outer', on=['id','level'], suffixes=["_", ""])
df1['id'] = np.where(df1['new_id'].isnull(), df1['id'], df1['new_id'])
df1['type'] = np.where(df1['type'].isnull(), df1['type_'], df1['type'])
df1['level'] = np.where(df1['level']=='l2', 'l3', df1['level'])
df1 = df1.drop('new_id', 1)
df1 = df1.drop('type_', 1)
Which finally gives me, df1
:
id level val type
1 l1 0.2 A
19 l3 0.1 A
19 l3 0.1 B
20 l3 0.1 A
20 l3 0.1 B
Since df1
is a large dataframe, I want to avoid using np.where
thrice, i.e., I am looking for a faster approach to get this. Thanks!
1 Answer 1
keyword arguments
I had to check the documentation to find out what the 1
means in df1.drop('new_id', 1)
. More clear would be df1.drop('new_id', axis=1)
, or even better: df1.drop(columns=["type_", "new_id"])
Don't overwrite raw data
You add columns and values to df1
, which makes finding problems harder.
With a bit of reshuffling, and using df.join
instead of pd.merge
, you can make your intent a bit more clear
df6= df1.join(
df2.assign(level="l1").set_index(["id", "level"]),
how="outer",
on=["id", "level"],
).join(
df3.assign(level="l2").set_index(["id", "level"]),
how="outer",
on=["id", "level"],
lsuffix="_",
)
np.where
If you're looking for null, you can use fillna
or combine_first
. To replace 'l2'
with 'l3'
, you can use Series.replace
.
df7 = df6.assign(
id=df6["new_id"].fillna(df6["id"]).astype(int),
type=df6["type_"].fillna(df6["type"]),
level=df6["level"].replace("l2", "l3")
).drop(columns=["type_", "new_id"])
I don't know whether this approach will be faster, but to me, what happens here is a lot more clear.