compare different pandas dataframes to find matching rows

Question 1

I have two dataframes: One contains of name and year.

**name** **year**
ram 1873
rob 1900

Second contains names and texts.

**name** **text**
ram A good kid
ram He was born on 1873
rob He is tall
rob He is 12 yrs old
rob His father died at 1900

I want to find the indices of the rows of second dataframe where the name of second dataframe matches with name of the first df and the text in second df contains the year in first df.

The result should be indices 1,4

My Code:

ind_list = []
for ind1, old in enumerate(A.name):
 for ind2, new in enumerate(B.name):
 if A.name[ind1] == B.name[ind2]:
 if A.year[ind1] in B.text[ind2]:
 ind_list.append(ind2)

Any better way to write the above code?

Question 2

I have added the python tag, this one should always be provided as companion of a python-* tag.

Question 3

Here is what we start with.

In [16]: df1
Out[16]:
 name year
0 ram 1873
1 rob 1900
In [17]: df2
Out[17]:
 name text
0 ram A good kid
1 ram He was born on 1873
2 rob He is tall
3 rob He is 12 yrs old
4 rob His father died at 1900

What you probably want to do is merge your two DataFrames. If you're familiar with SQL, this is just like a table join. The pd.merge step essentially "adds" the columns from df1 to df2 by checking where the two DataFrames match on the column "name". Then, once you have the columns you want ("year" and "text") matching according to the "name" column, we apply the function lambda x: str(x.year) in x.text (which checks if the year is present in the text) across the rows (axis=1).

In [18]: cond = pd.merge(
 ...: left=df2,
 ...: right=df1,
 ...: how="left",
 ...: left_on="name",
 ...: right_on="name",
 ...: ).apply(lambda x: str(x.year) in x.text, axis=1)

This gives us a Series which has the same index as your second DataFrame, and contains boolean values telling you if your desired condition is met or not.

In [19]: cond
Out[19]:
0 False
1 True
2 False
3 False
4 True
dtype: bool

Then, we filter your Series to where the condition is true, and give the index, optionally converting it to a list.

In [20]: cond[cond].index
Out[20]: Int64Index([1, 4], dtype='int64')
In [21]: cond[cond].index.tolist()
Out[21]: [1, 4]

If all you need later on is to iterate over the indices you've gotten, In [18] and In [20] will suffice.

Question 4

Thanks.. this is good. But if I apply it for different data-frames the 1/3 rd of the total number of rows from cond dataframe is NaN. What could be the possible reason

Question 5

I can't know for sure, but if, for example, df2 contains names not present in df1, then those rows will probably get filled in with NaN during the join/merge. Depending of your use case, you might treat those cases differently, but if I interpret your question strictly, then replacing str(x.year) in x.text with (str(int(x.year)) in x.text) if not pd.isnull(x.year) else False would probably be the best way to go (converting the NaN to False so they don't appear in the final list of indices).

Andrew Andrew 614 bronze badges · Answer 1 · 2019-08-07 02:02:28Z

Here is what we start with.

In [16]: df1
Out[16]:
 name year
0 ram 1873
1 rob 1900
In [17]: df2
Out[17]:
 name text
0 ram A good kid
1 ram He was born on 1873
2 rob He is tall
3 rob He is 12 yrs old
4 rob His father died at 1900

What you probably want to do is merge your two DataFrames. If you're familiar with SQL, this is just like a table join. The pd.merge step essentially "adds" the columns from df1 to df2 by checking where the two DataFrames match on the column "name". Then, once you have the columns you want ("year" and "text") matching according to the "name" column, we apply the function lambda x: str(x.year) in x.text (which checks if the year is present in the text) across the rows (axis=1).

In [18]: cond = pd.merge(
 ...: left=df2,
 ...: right=df1,
 ...: how="left",
 ...: left_on="name",
 ...: right_on="name",
 ...: ).apply(lambda x: str(x.year) in x.text, axis=1)

This gives us a Series which has the same index as your second DataFrame, and contains boolean values telling you if your desired condition is met or not.

In [19]: cond
Out[19]:
0 False
1 True
2 False
3 False
4 True
dtype: bool

Then, we filter your Series to where the condition is true, and give the index, optionally converting it to a list.

In [20]: cond[cond].index
Out[20]: Int64Index([1, 4], dtype='int64')
In [21]: cond[cond].index.tolist()
Out[21]: [1, 4]

If all you need later on is to iterate over the indices you've gotten, In [18] and In [20] will suffice.

Thanks.. this is good. But if I apply it for different data-frames the 1/3 rd of the total number of rows from cond dataframe is NaN. What could be the possible reason
I can't know for sure, but if, for example, df2 contains names not present in df1, then those rows will probably get filled in with NaN during the join/merge. Depending of your use case, you might treat those cases differently, but if I interpret your question strictly, then replacing str(x.year) in x.text with (str(int(x.year)) in x.text) if not pd.isnull(x.year) else False would probably be the best way to go (converting the NaN to False so they don't appear in the final list of indices).

Stack Exchange Network

compare different pandas dataframes to find matching rows

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

compare different pandas dataframes to find matching rows

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions