I'm looking for an optimum way to join two Dataframes where the same record exists in both. Examples would be:
dat1 = pd.DataFrame({"id": [1,2,3,4,5], "dat1": [34,56,57,45,23]})
dat2 = pd.DataFrame({"id": [2,3,4], "dat2": [19,19,20]})
dat1
id dat1
0 1 34
1 2 56
2 3 57
3 4 45
4 5 23
dat2
id dat2
0 2 19
1 3 19
2 4 20
With the aim being to create overall_data:
overall_data
id dat1 dat2
0 2 56 19
1 3 57 19
2 4 45 20
At the moment my method is:
dat3 = dat1['id'].isin(dat2['id'])
dat3 = pd.Dataframe(dat3)
dat3.columns = ['bool']
dat4 = dat1.join(dat3)
overall_data = dat4(dat4['bool'] == True)
Though this feels very messy. Is there a nicer way to do this?
1 Answer 1
This is the textbook example of an inner join. The most canonical way to have your id
columns being used for the matching, set them as an index first (here using inplace
operations to save on extra variable names; depending on your use, you might prefer new copies instead):
dat1.set_index('id', inplace=True)
dat2.set_index('id', inplace=True)
Then, the whole operation becomes this simple join on index:
>>> overall_data = dat1.join(dat2, how='inner')
>>> overall_data
dat1 dat2
id
2 56 19
3 57 19
4 45 20
If you do not want to modify the original DataFrames, you can use utility function merge instead, which performs the same operation, but needs the common column name specified explicitely:
>>> pd.merge(dat1, dat2, how='inner', on='id')
id dat1 dat2
0 2 56 19
1 3 57 19
2 4 45 20
-
1\$\begingroup\$ @SebSquire: please consider to accept (checkmark symbol below the vote counter of) this answer, if it has answered the question to your satisfaction. ;-) \$\endgroup\$ojdo– ojdo2019年04月02日 09:33:41 +00:00Commented Apr 2, 2019 at 9:33