1

I have the following code snippet which I created with the help of this tutorial for unsupervised sentiment analysis purposes:

sent = [row for row in file_model.message]
phrases = Phrases(sent, min_count=1, progress_per=50000)
bigram = Phraser(phrases)
sentences = bigram[sent]
sentences[1]
file_export = file_model.copy()
file_export['old_message'] = file_export.message
file_export.old_message = file_export.old_message.str.join(' ')
file_export.message = file_export.message.apply(lambda x: ' '.join(bigram[x]))
file_export.to_csv('cleaned_dataset.csv', index=False)

Since now I want to have bigrams as well as trigrams, I tried it by adjusting it to:

sent = [row for row in file_model.message]
phrases = Phrases(sent, min_count=1, progress_per=50000)
bigram = Phraser(phrases)
trigram = Phraser(bigram[phrases])
sentences = trigram[sent]
sentences[1]
file_export = file_model.copy()
file_export['old_message'] = file_export.message
file_export.old_message = file_export.old_message.str.join(' ')
file_export.message = file_export.message.apply(lambda x: ' '.join(trigram[x]))
file_export.to_csv('cleaned_dataset.csv', index=False)

But when I run this, I get TypeError: 'int' object is not iterable which I assume refers to my adjustment to trigram = Phraser(bigram[phrases]). I am using gensim 4.1.2. Unfortunately, I have no computer science background and solutions I find online don't help out.

asked Apr 18, 2022 at 16:22

1 Answer 1

0

As a general matter, it's best if you include in your question (by later editing if necessary) the entire multiline error message you received, including any 'traceback' showing involved filenames, line-numbers, & lines-of-source-code. That helps potential answerers focus on exactly where things are going wrong.

Also, beware that many of the tutorials at 'towardsdatascience.com' are of very poor quality. I can't see the exact one you've linked without registering (which I'd rather not do), but from your code excerpts, I already see a few issues of varying severity for what you're trying to do:

  • (fatal) If you want to apply the Phrases algorithm more than once, to compose up phrases longes than bigrams, you can't reuse the model trained for bigrams. You need to train a new model for each new level-of-combination, on the output of the prior model. That is, the input to the trigrams Phrases model (which must be trained) for trigrams must be the results of applying the bigram model, so it sees a mixture of the original unigrams & now-combined bigrams.
  • (unwise) Generally, using a low min_count=1 on these sorts of data-hungry models can easily backfire. They need many examples for their statistical-methods to do anything sensible; discarding the rarest words usually helps to speed processing, shrink the models, & work mainly on tokens where there's enough examples to do something possibly sensible. (With very few, or only 1, usage examples, results may seem somewhat random/arbitrary.)
  • (a bit oudated but not a big problem) In Gensim 4+, the Phraser utiity class – which just exists to optimized the Phrases model a bit, when you're sure you're done training/tuning – has been renamed FrozenPhrases. (The old name still works, but this is an indication the tutorial hasn't been recently refreshed.)

And in general, beware: without a lot of data, the output of any number of Phrases applications may not be strong. And in all cases, it may not 'look right' to human sensibilities – because it's pure statistical, co-occurrence driven. (Though, even if its output looks weird, it will sometimes help on certain info-retrieval/classification tasks, as it manages to create useful new features that are different than the unigrams alone.)

My suggestions would be:

  • only add any Phrases combinations after things are working without, so you can compare results & see if it's helping.
  • start with bigrams only, and be sure via careful review or rigorous scoring that's working/helping
  • if you need another level of combination, add that later, & ensure the trigram Phrases is initialized with the already-bigram-combined texts.

(Unfortunately, I can't find an example of two-level Phrases use in the current Gensim docs – I think some old examples were edited-out in doc simplification work. But there are a couple examples of it not being used all-wrong in the project's testing source code – search the file https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_phrases.py for trigram. But remember those aren't best practices, either, as focused minimal tests.)

answered Apr 18, 2022 at 23:05
Sign up to request clarification or add additional context in comments.

2 Comments

Hey, thank you so much for your answer! It really helps me. Just one more question: 1 Lets say I want to try out the model with unigrams only, how can I do that? By simply taking a code like: sent = [row for row in file_model.message] phrases = Phrases(sent, min_count=1, progress_per=50000) sentences = phrases[sent] sentences[1] instead of the originial one? sent = [row for row in file_model.message] phrases = Phrases(sent, min_count=1, progress_per=50000) bigram = Phraser(phrases) sentences = bigram[sent] sentences[1]
If you simply want to apply Phrases once, to the original unigrams, then get a transformed corpus where some of the statistically-interesting word-pairs are combined into word1_word2 bigrams, your code looks about right. But (1) above comment re min_count still applies; (2) the real test is whether the output sequence includes text changed the way you expect - when you try it, does it look right?; (3) the Gensim docs includes model code for that simple application: radimrehurek.com/gensim/models/phrases.html

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.