5
\$\begingroup\$

Consider a .csv file that contains a set of video names like so:

"There are happy days","1204923"
"Beware of ignorance","589636"
"Bloody Halls MV","258933"
"Dream Theater - As I Am - Live in...","89526"

The intent of the code I built is to filter items in the csv depending on the list of excluded items. Therefore, if the name of the video contains a word in the list of excluded items, it'll be rejected for saving. The following is the code:

exclude_list = ["mv","live","cover","remix","bootleg"]
data_set = []
with open('video_2013-2016.csv', 'rb') as f:
 reader = csv.reader(f)
 for row in reader:
 # Only record videos with at least 100 views
 if int(row[1]) > 99:
 # A test list that holds whether the regex passes or fails
 test_list = []
 for ex in exclude_list:
 regex = re.compile(".*("+ex+").*")
 if regex.search(row[0]):
 test_list.append(False)
 else:
 test_list.append(True)
 # Depending on the results, see if the row is worthy of saving
 if all(result for result in test_list):
 data_set.append(row)

I know the code I wrote above is quite inefficient, and I've seen examples of list comprehensions that can do a better job, but I do not quite understand how list comprehension can work in this case. I just hate it that I have to create the regex variable many times and it feels like a waste of resource.

200_success
145k22 gold badges190 silver badges478 bronze badges
asked Jul 19, 2016 at 22:24
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

The CSV file contains text in some text encoding, and should not be opened in binary mode.

You should construct one regular expression to find any of the forbidden words. It appears that you intended to do a case-insensitive search, but didn't write the code that way. When constructing the regex, you should escape the strings, in case they contain any regex metacharacters. You don't need .*, since re.search() will look for the pattern anywhere in the string, nor do you need capturing parentheses.

If your comment says 100, then your code should have 100 rather than 99.

I suggest doing a destructuring assignment title, view_count = row to make it clear what each column represents.

with open('video_2013-2016.csv') as f:
 forbidden = re.compile('|'.join(re.escape(w) for w in exclude_list), re.I)
 for row in csv.reader(f):
 # Only record videos with at least 100 views and none of the bad words
 title, view_count = row
 if int(view_count) >= 100 and not forbidden.search(title):
 data_set.append(row)
answered Jul 19, 2016 at 22:50
\$\endgroup\$
2
  • \$\begingroup\$ I can't believe I didn't think of regex OR method! Absolutely beautiful solution -> forbidden = re.compile('|'.join(re.escape(w) for w in exclude_list), re.I). I was very worried that I'm being inefficient for not using list comprehension, but now I see that I really didn't need any list comprehension. \$\endgroup\$ Commented Jul 19, 2016 at 22:55
  • \$\begingroup\$ From 1.8s to 0.1s, you're a beast! Thank you very much for your help! \$\endgroup\$ Commented Jul 19, 2016 at 22:58

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.