I have two dataframes: One contains of name and year.
**name** **year**
ram 1873
rob 1900
Second contains names and texts.
**name** **text**
ram A good kid
ram He was born on 1873
rob He is tall
rob He is 12 yrs old
rob His father died at 1900
I want to find the indices of the rows of second dataframe where the name of second dataframe matches with name of the first df and the text in second df contains the year in first df.
The result should be indices 1,4
My Code:
ind_list = []
for ind1, old in enumerate(A.name):
for ind2, new in enumerate(B.name):
if A.name[ind1] == B.name[ind2]:
if A.year[ind1] in B.text[ind2]:
ind_list.append(ind2)
Any better way to write the above code?
-
1\$\begingroup\$ I have added the python tag, this one should always be provided as companion of a python-* tag. \$\endgroup\$dfhwze– dfhwze2019年08月06日 10:20:26 +00:00Commented Aug 6, 2019 at 10:20
1 Answer 1
Here is what we start with.
In [16]: df1
Out[16]:
name year
0 ram 1873
1 rob 1900
In [17]: df2
Out[17]:
name text
0 ram A good kid
1 ram He was born on 1873
2 rob He is tall
3 rob He is 12 yrs old
4 rob His father died at 1900
What you probably want to do is merge your two DataFrames. If you're familiar with SQL, this is just like a table join. The pd.merge
step essentially "adds" the columns from df1
to df2
by checking where the two DataFrames match on the column "name". Then, once you have the columns you want ("year" and "text") matching according to the "name" column, we apply the function lambda x: str(x.year) in x.text
(which checks if the year is present in the text) across the rows (axis=1
).
In [18]: cond = pd.merge(
...: left=df2,
...: right=df1,
...: how="left",
...: left_on="name",
...: right_on="name",
...: ).apply(lambda x: str(x.year) in x.text, axis=1)
This gives us a Series which has the same index as your second DataFrame, and contains boolean values telling you if your desired condition is met or not.
In [19]: cond
Out[19]:
0 False
1 True
2 False
3 False
4 True
dtype: bool
Then, we filter your Series to where the condition is true, and give the index, optionally converting it to a list.
In [20]: cond[cond].index
Out[20]: Int64Index([1, 4], dtype='int64')
In [21]: cond[cond].index.tolist()
Out[21]: [1, 4]
If all you need later on is to iterate over the indices you've gotten, In [18]
and In [20]
will suffice.
-
\$\begingroup\$ Thanks.. this is good. But if I apply it for different data-frames the 1/3 rd of the total number of rows from cond dataframe is NaN. What could be the possible reason \$\endgroup\$DGS– DGS2019年08月12日 14:24:51 +00:00Commented Aug 12, 2019 at 14:24
-
\$\begingroup\$ I can't know for sure, but if, for example,
df2
contains names not present indf1
, then those rows will probably get filled in with NaN during the join/merge. Depending of your use case, you might treat those cases differently, but if I interpret your question strictly, then replacingstr(x.year) in x.text
with(str(int(x.year)) in x.text) if not pd.isnull(x.year) else False
would probably be the best way to go (converting the NaN to False so they don't appear in the final list of indices). \$\endgroup\$Andrew– Andrew2019年08月13日 00:31:34 +00:00Commented Aug 13, 2019 at 0:31