I have a dataset X_train
, which is an array where each entry is an email (a string of characters). There are 11,314 emails, each of which is about 500 characters long. (X_train
is a processed version of the training data in the newsgroups dataset.)
Ultimately, my goal is to build from scratch a tf-idf function (knowledge of which is probably not necessary for answering my question). To get there, I have constructed a lexicon which contains each unique word in X_train
once and only once. My lexicon has 211441 elements. I also need an array where each entry frequency_train[i]
is the number of emails in which a given term lexicon_train[i]
appears.
I construct the frequency array as follows:
frequency_train = np.zeros(211441)
for i in range(211441):
count = 0
for email in X_train:
if lexicon_train[i] in email:
count = count + 1
frequency_train[i] = count
In the same cell, I am also doing something similar with the testing data X_test
. I've been running this in Jupyter notebook, and this process takes a while. A previous and very similar task took about 90 minutes. I suspect that I'm doing this task the slowest possible way. Is there a faster way of doing this? I would also welcome answers that explain why this process should take a long time.
2 Answers 2
For each word in the lexicon you are searching through each email: (11,314 emails) * (60 words/email) * (211441 word lexicon) = lots of comparisons.
Flip it around. Use collections.Counter
. Get the unique words in each email (use a set()) and then and update the counter.
from collections import Counter
counts = Counter()
for email in x_train:
words = set(email.split()) # <= or whatever you use to parse the words
counts.update(words)
This will give you a dict mapping words in the emails to the number of emails they are in. (11,314 emails) * (60 words/email) = a lot fewer loops.
This probably also recreated the lexicon (e.g.
counter.keys()
should be the lexicon.
On my computer, it takes 7 seconds to generate 115000 random 60-word emails and collect the counts.
-
\$\begingroup\$ Yes, I would go with this answer. Improves performance and reduces code required. Nice job. \$\endgroup\$Ben A– Ben A2020年03月26日 06:19:53 +00:00Commented Mar 26, 2020 at 6:19
Your for
loop can be reduced to one line, utilizing sum
:
frequency_train = [
sum(1 if lexicon_train[i] in email else 0 for email in X_train) for i in range(211441)
]
It removes the need to create the initial list of zeros. For performance, I'm guessing the size of the lexicon and the number of iterations are slowing it down.
-
\$\begingroup\$ Thank you! This does indeed simplify my code. However, about 15 minutes later, the cell is still running. There may very well be no way around this: just running through the for loops requires about 2.2 billion steps, not to mention the other computations that the entire cell requires. I'm pretty new here, so I'll defer to the community as to whether or not I should accept this as an answer. \$\endgroup\$co-contravariant– co-contravariant2020年03月26日 00:34:16 +00:00Commented Mar 26, 2020 at 0:34
-
1\$\begingroup\$ @co-contravariant This answer is really just about reducing the lines in your program and utilizing a built in function. If an answer comes along that reduces your performance, definitely go with that one. \$\endgroup\$Ben A– Ben A2020年03月26日 00:47:43 +00:00Commented Mar 26, 2020 at 0:47
-
\$\begingroup\$ For anyone in the audience who's curious: the process has finally terminated. It took about an hour. Now onto the testing data... \$\endgroup\$co-contravariant– co-contravariant2020年03月26日 01:34:41 +00:00Commented Mar 26, 2020 at 1:34
-
\$\begingroup\$ Cf. Histogram word counter in Python \$\endgroup\$greybeard– greybeard2020年03月26日 06:10:54 +00:00Commented Mar 26, 2020 at 6:10
X_train
andlexicon_train
? Do you only need the totalcount
, what are the bounds? It's almost like you're trying to impede us from helping you. \$\endgroup\$"mark"
is inlexicon_train
, then thein email
will count"Denmark"
and"marker"
, but not"Mark"
. Should only complete words be matched? What glyphs can exist in the words? Hyphens or apostrophes? \$\endgroup\$frequency_train
will contain incorrect counts. If an email contains"i’m going to denmark"
, andlexicon_train
contains"mark"
and"denmark"
, the email will be counted as containing both those words, because"mark" in "i’m going to denmark"
isTrue
. It would also be counted as including wordsden
,go
,in
, andark
if those words also appear inlexicon_train
becausestr in str
checks if the needle appears anywhere in the haystack, without regard for word boundaries. \$\endgroup\$