1
\$\begingroup\$

I have datasetA with 90,000 rows and datasetB with 5,000 rows. Each dataset has a column called "ID" with employee IDs. My goal is to to create another column in datasetA that identifies whether the employee ID in datasetA is also in datasetB with a True/False. Additionally, there are most likely some multiples for certain employees/employee ids in both datasets. I am fairly certain that the code I wrote works, but it is way too slow, and I was wondering what I could change to make it faster? Thanks!

#Creating the new column to identify whether the ID in datasetA is also in datasetB.
datasetA["inB"] = "Empty"
# Looping through
for id_num in datasetA["ID"]:
 filt = (datasetA["ID"] == id_num)
 if (datasetB["ID"] == id_num).any():
 datasetA.loc[filt, "inB"] = True
 else:
 datasetA.loc[filt, "inB"] = False
```
asked Feb 13, 2021 at 20:09
\$\endgroup\$
1

1 Answer 1

2
\$\begingroup\$

Is this what you want?

import pandas as pd
datasetA = pd.DataFrame(
 [
 [
 'ID222'
 ],
 [
 'ID233'
 ],
 [
 'ID2123'
 ],
 [
 'ID233'
 ]
 ], columns = ['ID']
)
datasetB = pd.DataFrame(
 [
 [
 'ID222'
 ],
 [
 'ID233'
 ],
 [
 'ID212355'
 ],
 [
 'ID233'
 ]
 ], columns = ['ID']
)
datasetA["inB"] = datasetA.ID.isin(datasetB.ID)
datasetA.drop_duplicates()
 ID inB
0 ID222 True
1 ID233 True
2 ID2123 False
answered Feb 13, 2021 at 23:05
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.