Improving CSV filtering with Python using regex

Question 1

Consider a .csv file that contains a set of video names like so:

"There are happy days","1204923"
"Beware of ignorance","589636"
"Bloody Halls MV","258933"
"Dream Theater - As I Am - Live in...","89526"

The intent of the code I built is to filter items in the csv depending on the list of excluded items. Therefore, if the name of the video contains a word in the list of excluded items, it'll be rejected for saving. The following is the code:

exclude_list = ["mv","live","cover","remix","bootleg"]
data_set = []
with open('video_2013-2016.csv', 'rb') as f:
 reader = csv.reader(f)
 for row in reader:
 # Only record videos with at least 100 views
 if int(row[1]) > 99:
 # A test list that holds whether the regex passes or fails
 test_list = []
 for ex in exclude_list:
 regex = re.compile(".*("+ex+").*")
 if regex.search(row[0]):
 test_list.append(False)
 else:
 test_list.append(True)
 # Depending on the results, see if the row is worthy of saving
 if all(result for result in test_list):
 data_set.append(row)

I know the code I wrote above is quite inefficient, and I've seen examples of list comprehensions that can do a better job, but I do not quite understand how list comprehension can work in this case. I just hate it that I have to create the regex variable many times and it feels like a waste of resource.

Question 2

The CSV file contains text in some text encoding, and should not be opened in binary mode.

You should construct one regular expression to find any of the forbidden words. It appears that you intended to do a case-insensitive search, but didn't write the code that way. When constructing the regex, you should escape the strings, in case they contain any regex metacharacters. You don't need .*, since re.search() will look for the pattern anywhere in the string, nor do you need capturing parentheses.

If your comment says 100, then your code should have 100 rather than 99.

I suggest doing a destructuring assignment title, view_count = row to make it clear what each column represents.

with open('video_2013-2016.csv') as f:
 forbidden = re.compile('|'.join(re.escape(w) for w in exclude_list), re.I)
 for row in csv.reader(f):
 # Only record videos with at least 100 views and none of the bad words
 title, view_count = row
 if int(view_count) >= 100 and not forbidden.search(title):
 data_set.append(row)

Question 3

I can't believe I didn't think of regex OR method! Absolutely beautiful solution -> forbidden = re.compile('|'.join(re.escape(w) for w in exclude_list), re.I). I was very worried that I'm being inefficient for not using list comprehension, but now I see that I really didn't need any list comprehension.

Question 4

From 1.8s to 0.1s, you're a beast! Thank you very much for your help!

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Accepted Answer · 2016-07-19 22:50:27Z

The CSV file contains text in some text encoding, and should not be opened in binary mode.

You should construct one regular expression to find any of the forbidden words. It appears that you intended to do a case-insensitive search, but didn't write the code that way. When constructing the regex, you should escape the strings, in case they contain any regex metacharacters. You don't need .*, since re.search() will look for the pattern anywhere in the string, nor do you need capturing parentheses.

If your comment says 100, then your code should have 100 rather than 99.

I suggest doing a destructuring assignment title, view_count = row to make it clear what each column represents.

with open('video_2013-2016.csv') as f:
 forbidden = re.compile('|'.join(re.escape(w) for w in exclude_list), re.I)
 for row in csv.reader(f):
 # Only record videos with at least 100 views and none of the bad words
 title, view_count = row
 if int(view_count) >= 100 and not forbidden.search(title):
 data_set.append(row)

I can't believe I didn't think of regex OR method! Absolutely beautiful solution -> forbidden = re.compile('|'.join(re.escape(w) for w in exclude_list), re.I). I was very worried that I'm being inefficient for not using list comprehension, but now I see that I really didn't need any list comprehension.
From 1.8s to 0.1s, you're a beast! Thank you very much for your help!

Stack Exchange Network

Improving CSV filtering with Python using regex

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Improving CSV filtering with Python using regex

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions