Merge two DataFrames with key over-write

Question 1

I have implemented a custom merge of 2 pandas DataFrames, but only mutates the first.

The new columns will be the union of the original columns and the new keys will be union of original keys. Additionally, the second DataFrame values will overwrite the first DataFrame if the keys match.

def merge_dataframes(df1, df2):
 for header in set(df2) - (set(df1) & set(df2)):
 df1[header] = None
 old_keys = df2.index.isin(df1.index)
 new_keys = ~old_keys
 df1.update(df2[old_keys])
 df1 = df1.append(df2[new_keys], sort=True)
 return df1

This is tested as follows:

 df1 = pandas.DataFrame(data=["Test1", "Test2"])
 df2 = pandas.DataFrame(data=["Test3", "Test4"])
 # this should simply overwrite df1 with df2
 new_df = merge_dataframes(df1, df2)
 assert (new_df == df2).all()[0]
 df1 = pandas.DataFrame(data=["Test1", "Test2"])
 df2 = pandas.DataFrame(data=["Test3", "Test4"], index=["a", "b"])
 # this should simply overwrite df1 with df2
 new_df = merge_dataframes(df1, df2)
 expected_df = pandas.DataFrame(data=["Test1", "Test2", "Test3", "Test4"], index=[0, 1, "a", "b"])
 assert (new_df == expected_df).all()[0]

Is there a way to accomplish this with less code?

Question 2

This looks like df1= pd.concat((df1,df2)).drop_duplicates(keep='last') let me know If I am wrong also if you can extend the example, that'd be great.

Question 3

Also may be you can try df1 = pd.concat((df1,df2)).groupby(level=0).last()

Question 4

Your terminology should be modified a little to conform better to the Pandas documentation: "keys" are really just the index.

Your function has a confused idea of in-place versus out-of-place. It mutates the input but also returns something, suggesting that it's being done out-of-place. Choose one or the other (prefer out-of-place).

set(df2) - (set(df1) & set(df2)) is just set(df2) - set(df1). But this is confusing - you should explicitly refer to .columns rather than using that implicitly.

DataFrame.append() has been deprecated. Instead, interpreted literally you would use something like

 df1 = pd.concat(
 (df1, df2[new_keys]),
 axis='rows', sort=True,
 )

which does in fact get your tests to pass. However, this isn't particularly the approach you should take. I don't trust set subtraction to work on a frame having a multi-level column index, for one; I also don't trust that your implementation preserves dtypes.

I started out with this horrible kludge:

dest, df2_aligned = df1.align(df2)
cols1 = set(df1.columns)
cols2 = set(df2_aligned.columns)
new_cols = list(cols2 - cols1)
shared_cols = list(cols2 & cols1)
if len(new_cols) > 0:
 dest[new_cols] = df2_aligned[new_cols]
if len(shared_cols) > 0:
 dest.loc[df2.index, shared_cols] = df2[shared_cols]
return dest

because it's out-of-place, and I had a difficult time convincing the various incantations of merge, concat and join to do what you want. It passes your tests, but you shouldn't use it, either.

@anky's first suggestion doesn't work. Whereas their second one -

pd.concat((df1, df2)).groupby(level=0).last()

does, I somewhat doubt that it will behave sanely for multi-level indices, so I will not in this review suggest that you use it. A safer version is

def merge_dataframes(df1: pd.DataFrame, df2: pd.DataFrame) -> pd.DataFrame:
 combined = pd.concat((df1, df2))
 return combined.groupby(
 level=list(range(combined.index.nlevels)),
 ).last()

Broadly, I have my misgivings about this function existing at all. pd.merge() makes more explicit what you're joining on and how to treat duplicates, and I fear that a blind reliance on this function would lead to confusing bugs from data silently being coerced or obliterated.

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2025-06-30 02:05:42Z

Your terminology should be modified a little to conform better to the Pandas documentation: "keys" are really just the index.

Your function has a confused idea of in-place versus out-of-place. It mutates the input but also returns something, suggesting that it's being done out-of-place. Choose one or the other (prefer out-of-place).

set(df2) - (set(df1) & set(df2)) is just set(df2) - set(df1). But this is confusing - you should explicitly refer to .columns rather than using that implicitly.

DataFrame.append() has been deprecated. Instead, interpreted literally you would use something like

 df1 = pd.concat(
 (df1, df2[new_keys]),
 axis='rows', sort=True,
 )

which does in fact get your tests to pass. However, this isn't particularly the approach you should take. I don't trust set subtraction to work on a frame having a multi-level column index, for one; I also don't trust that your implementation preserves dtypes.

I started out with this horrible kludge:

dest, df2_aligned = df1.align(df2)
cols1 = set(df1.columns)
cols2 = set(df2_aligned.columns)
new_cols = list(cols2 - cols1)
shared_cols = list(cols2 & cols1)
if len(new_cols) > 0:
 dest[new_cols] = df2_aligned[new_cols]
if len(shared_cols) > 0:
 dest.loc[df2.index, shared_cols] = df2[shared_cols]
return dest

because it's out-of-place, and I had a difficult time convincing the various incantations of merge, concat and join to do what you want. It passes your tests, but you shouldn't use it, either.

@anky's first suggestion doesn't work. Whereas their second one -

pd.concat((df1, df2)).groupby(level=0).last()

does, I somewhat doubt that it will behave sanely for multi-level indices, so I will not in this review suggest that you use it. A safer version is

def merge_dataframes(df1: pd.DataFrame, df2: pd.DataFrame) -> pd.DataFrame:
 combined = pd.concat((df1, df2))
 return combined.groupby(
 level=list(range(combined.index.nlevels)),
 ).last()

Broadly, I have my misgivings about this function existing at all. pd.merge() makes more explicit what you're joining on and how to treat duplicates, and I fear that a blind reliance on this function would lead to confusing bugs from data silently being coerced or obliterated.

Stack Exchange Network

Merge two DataFrames with key over-write

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Merge two DataFrames with key over-write

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions