2
\$\begingroup\$

This is my first non-trivial program in my Python. I am coming from a Java background and I might have messed up or ignored some conventions. I would like to hear feedback on my code.


import nltk
import random
file_name = input()
file = open(file_name, "r", encoding="utf-8")
# nltk.trigrams returns a list of 3-tuples
trigrams = list(nltk.trigrams(file.read().split()))
file.close()
model = {}
for trigram in trigrams:
 head = trigram[0] + " " + trigram[1]
 tail = trigram[2]
 model.setdefault(head, {})
 model[head].setdefault(tail, 0)
 model[head][tail] += 1
possible_starting_heads = []
sentence_ending_punctuation = (".", "!", "?")
for key in model.keys():
 if key[0].isupper() and not key.split(" ")[0].endswith(sentence_ending_punctuation):
 possible_starting_heads.append(key)
# Generate 10 pseudo-sentences based on model
for _ in range(10):
 tokens = []
 # Chooses a random starting head from list
 head = random.choice(possible_starting_heads)
 # print("Head: ", head)
 tokens.append(head)
 while True:
 possible_tails = list(model[head].keys())
 weights = list(model[head].values())
 # Randomly select elements from list taking their weights into account
 most_probable_tail = random.choices(possible_tails, weights, k=1)[0]
 # print("Most probable tail: ", most_probable_tail)
 if most_probable_tail.endswith(sentence_ending_punctuation) and len(tokens) >= 5:
 tokens.append(most_probable_tail)
 # print("Chosen tail and ending sentence: ", most_probable_tail)
 break
 elif not most_probable_tail.endswith(sentence_ending_punctuation):
 tokens.append(most_probable_tail)
 # print("Chosen tail: ", most_probable_tail)
 head = head.split(" ")[1] + " " + most_probable_tail
 elif most_probable_tail.endswith(sentence_ending_punctuation) and len(tokens) < 5:
 # print("Ignoring tail: ", most_probable_tail)
 tokens = []
 head = random.choice(possible_starting_heads)
 tokens.append(head)
 pseudo_sentence = " ".join(tokens)
 print(pseudo_sentence)
```
200_success
146k22 gold badges190 silver badges478 bronze badges
asked Aug 30, 2021 at 9:32
\$\endgroup\$

2 Answers 2

2
\$\begingroup\$

Here's a kinda stream-of-consciousness review:

reading the file

input() takes an argument, typically a prompt or question, so the user knows enter something. You could also check sys.argv for a filename on the command line.

Avoid using names of built in functions (e.g., file) for a variable name.

open() is a context manager, so it can be used in a with statement. In the current code, an exception anywhere in list(nltk.trigrams(file.read().split())) could result in the file being left open. Using a with statement ensures the file is closed:

with open(file_name, "r", encoding="utf-8") as input_file:
 # nltk.trigrams returns a list of 3-tuples
 trigrams = list(nltk.trigrams(input_file.read().split()))

model data structure

Get to know the collections module in the standard library. Here, defaultdict and Counter would be useful.

A tuple can be a dictionary key, so concatenating the first two items in a trigram is not necessary.

import collections
model = collections.defaultdict(collections.Counter)
for trigram in trigrams:
 model[trigram[:2]].update(trigram[2:])

Or, to be a little clearer:

for word1, word2, word3 in trigrams:
 model[(word1, word2)].update((word3,))

Then, first words can be collected at the same time:

possible_starting_heads = collections.Counter()
for word1, word2, word3 in trigrams:
 model[(word1, word2)].update((word3,))
 if word1[0].isupper() and not word1.endswith(sentence_ending_punctuation):
 possible_starting_heads.update((word1, word2))

generating sentences

If you intend to make lots of sentences, recreating lists of candidate tails and their weights every time may slow things down. Consider restructuring the model to better suit the end use. This can be done all at once:

new_model = {}
for key, counter in model.items():
 new_model[key] = (list(counter.keys()), list(accumulate(counter.values())))

most_probable_tail is a misnomer; it's the chosen tail.

The logic of the if ... elif ... elif statement is not easy to follow, and .endswith(sentence_ending_punctuation) potentially gets called three times.

tokens = []
# Chooses a random starting head from list
head = random.choice(possible_starting_heads)
# print("Head: ", head)
tokens.append(head)
while True:
 possible_tails, weights = model[head]
 chosen_tail = random.choices(possible_tails, cum_weights=weights, k=1)
 if not chosen_tail.endswith(sentence_ending_punctuation):
 tokens.append(most_probable_tail)
 head = (head[1], chosen_tail)
 else:
 if len(tokens) >= 5:
 tokens.append(chosen_tail)
 break
 else:
 tokens = []
 head = random.choice(possible_starting_heads)
 tokens.append(head)
pseudo_sentence = " ".join(tokens)

other thoughts

You can store the first word of a sentence in the model with a key of ('', '') and the second word with a key of ('', word1). Then they don't need to be handled separately when generating the sentences.

NLTK has a function word_tokenize() that might be helpful. It breaks text into tokens such as words, numbers, and punctuation. It recognized things like Dr.

answered Aug 30, 2021 at 20:51
\$\endgroup\$
1
  • \$\begingroup\$ I double-checked after reading your answer, and file has been removed from the built-in functions somewhere between Python 2.7 and 3.5, neither of which is still supported. There's no harm avoiding using it as a variable using a somewhat recent version of Python but it shouldn't wreck havoc when that happens. \$\endgroup\$ Commented Aug 31, 2021 at 21:00
1
\$\begingroup\$

In the following, I assume that you use Python3 rather than Python2, though I can't say for sure that it makes a difference.

First, when you parse your file, you could use a context manager:

file_name = input()
with open(file_name, "r", encoding="utf-8") as file:
 # nltk.trigrams returns a list of 3-tuples
 trigrams = list(nltk.trigrams(file.read().split()))

That way, even if an exception is raised by the assignment to trigrams (in either of the 4 method calls), the file stream is properly closed.

You could simplify your model generation using defaultdict from the collections package. Your way of doing it is actually wrong and I can't say for sure that this option is more pythonic but it might be interesting to know.

import collections
model = collections.defaultdict(lambda : collections.defaultdict(int))
for trigram in trigrams:
 head = trigram[0] + " " + trigram[1]
 tail = trigram[2]
 model[head][tail] += 1

This does not change the behavior of your algorithm, it just feels a bit simpler to me.

But you can do something more memory-efficient:

import collections
model = collections.defaultdict(lambda : collections.defaultdict(int))
file_name = input()
with open(file_name, "r", encoding="utf-8") as file:
 # nltk.trigrams returns a list of 3-tuples
 trigrams = nltk.trigrams(file.read().split())
 for trigram in trigrams:
 head = trigram[0] + " " + trigram[1]
 tail = trigram[2]
 model[head][tail] += 1

Since nltk.trigrams returns an iterator, and since you only use it once, you don't actually need to store it in a list (an operation that will take some time and memory to copy everything from the iterator into the list) before iterating over it. You could even completely remove the trigrams variable and directly do for... in nltk.trigrams....

Finally, in your while loop, your most_probable_tail is misnamed: it is not the most probable one, it is the one your algorithm randomly selected using a possibly non-uniform law. I would rather call it candidate_tail, or selected_tail, or, even better, simply tail.

answered Aug 30, 2021 at 19:28
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.