I have implemented a custom merge of 2 pandas DataFrames, but only mutates the first.
The new columns will be the union of the original columns and the new keys will be union of original keys. Additionally, the second DataFrame values will overwrite the first DataFrame if the keys match.
def merge_dataframes(df1, df2):
for header in set(df2) - (set(df1) & set(df2)):
df1[header] = None
old_keys = df2.index.isin(df1.index)
new_keys = ~old_keys
df1.update(df2[old_keys])
df1 = df1.append(df2[new_keys], sort=True)
return df1
This is tested as follows:
df1 = pandas.DataFrame(data=["Test1", "Test2"])
df2 = pandas.DataFrame(data=["Test3", "Test4"])
# this should simply overwrite df1 with df2
new_df = merge_dataframes(df1, df2)
assert (new_df == df2).all()[0]
df1 = pandas.DataFrame(data=["Test1", "Test2"])
df2 = pandas.DataFrame(data=["Test3", "Test4"], index=["a", "b"])
# this should simply overwrite df1 with df2
new_df = merge_dataframes(df1, df2)
expected_df = pandas.DataFrame(data=["Test1", "Test2", "Test3", "Test4"], index=[0, 1, "a", "b"])
assert (new_df == expected_df).all()[0]
Is there a way to accomplish this with less code?
1 Answer 1
Your terminology should be modified a little to conform better to the Pandas documentation: "keys" are really just the index.
Your function has a confused idea of in-place versus out-of-place. It mutates the input but also returns something, suggesting that it's being done out-of-place. Choose one or the other (prefer out-of-place).
set(df2) - (set(df1) & set(df2))
is just set(df2) - set(df1)
. But this is confusing - you should explicitly refer to .columns
rather than using that implicitly.
DataFrame.append()
has been deprecated. Instead, interpreted literally you would use something like
df1 = pd.concat(
(df1, df2[new_keys]),
axis='rows', sort=True,
)
which does in fact get your tests to pass. However, this isn't particularly the approach you should take. I don't trust set subtraction to work on a frame having a multi-level column index, for one; I also don't trust that your implementation preserves dtypes.
I started out with this horrible kludge:
dest, df2_aligned = df1.align(df2)
cols1 = set(df1.columns)
cols2 = set(df2_aligned.columns)
new_cols = list(cols2 - cols1)
shared_cols = list(cols2 & cols1)
if len(new_cols) > 0:
dest[new_cols] = df2_aligned[new_cols]
if len(shared_cols) > 0:
dest.loc[df2.index, shared_cols] = df2[shared_cols]
return dest
because it's out-of-place, and I had a difficult time convincing the various incantations of merge
, concat
and join
to do what you want. It passes your tests, but you shouldn't use it, either.
@anky's first suggestion doesn't work. Whereas their second one -
pd.concat((df1, df2)).groupby(level=0).last()
does, I somewhat doubt that it will behave sanely for multi-level indices, so I will not in this review suggest that you use it. A safer version is
def merge_dataframes(df1: pd.DataFrame, df2: pd.DataFrame) -> pd.DataFrame:
combined = pd.concat((df1, df2))
return combined.groupby(
level=list(range(combined.index.nlevels)),
).last()
Broadly, I have my misgivings about this function existing at all. pd.merge()
makes more explicit what you're joining on and how to treat duplicates, and I fear that a blind reliance on this function would lead to confusing bugs from data silently being coerced or obliterated.
df1= pd.concat((df1,df2)).drop_duplicates(keep='last')
let me know If I am wrong also if you can extend the example, that'd be great. \$\endgroup\$df1 = pd.concat((df1,df2)).groupby(level=0).last()
\$\endgroup\$