Given a dataframe with three columns of text blobs to search through, which can be found in this Gist.
And three keywords that I want to identify in this text:
branches_of_sci = ['bio', 'chem', 'physics']
I've written the following code to identify the presence of these keywords:
dfq_col = ['Text A', 'Text B', 'Text C']
for branch in branches_of_sci:
for col in dfq_col:
temp_list = []
for row in df[col]:
if type(row) is not str:
temp_list.append(False)
elif type(row) is str:
temp_list.append(row.find(branch)>0)
df[branch] |= temp_list
This is the result of the data I linked to:
I think the main problem here is that I'm using a for-loop when I should be using some sort of dataframe-specific function, but I'm not sure how to restructure the code to accomplish this.
1 Answer 1
import pandas as pd
df = pd.read_clipboard(sep=',') # copied data from the gist
branches_of_sci = ['bio', 'chem', 'physics']
for branch in branches_of_sci:
df[branch] = df.astype(str).sum(axis=1).str.contains(branch)
In my limited experience, for loops are almost always wrong when using Pandas. The primary benefit of Pandas is vectorization, so using the built-in methods is typically best.
Here is a breakdown of the main function:
df[branch]
creates a new dataframe columndf.astype(str)
converts all of the dtypes in the dataframe to strings.sum(axis=1)
concatenates all dataframe columns horizontally (i.e. axis=1).str.contains()
use built-in string search (see docs)
Hopefully that helps.