I have datasetA with 90,000 rows and datasetB with 5,000 rows. Each dataset has a column called "ID" with employee IDs. My goal is to to create another column in datasetA that identifies whether the employee ID in datasetA is also in datasetB with a True/False. Additionally, there are most likely some multiples for certain employees/employee ids in both datasets. I am fairly certain that the code I wrote works, but it is way too slow, and I was wondering what I could change to make it faster? Thanks!
#Creating the new column to identify whether the ID in datasetA is also in datasetB.
datasetA["inB"] = "Empty"
# Looping through
for id_num in datasetA["ID"]:
filt = (datasetA["ID"] == id_num)
if (datasetB["ID"] == id_num).any():
datasetA.loc[filt, "inB"] = True
else:
datasetA.loc[filt, "inB"] = False
```
-
1\$\begingroup\$ You can do that with an inner join. pandas.pydata.org/pandas-docs/stable/reference/api/… \$\endgroup\$Tweakimp– Tweakimp2021年02月13日 22:17:40 +00:00Commented Feb 13, 2021 at 22:17
1 Answer 1
Is this what you want?
import pandas as pd
datasetA = pd.DataFrame(
[
[
'ID222'
],
[
'ID233'
],
[
'ID2123'
],
[
'ID233'
]
], columns = ['ID']
)
datasetB = pd.DataFrame(
[
[
'ID222'
],
[
'ID233'
],
[
'ID212355'
],
[
'ID233'
]
], columns = ['ID']
)
datasetA["inB"] = datasetA.ID.isin(datasetB.ID)
datasetA.drop_duplicates()
ID inB
0 ID222 True
1 ID233 True
2 ID2123 False