Markov text generator program in Python

Question 1

This is my first non-trivial program in my Python. I am coming from a Java background and I might have messed up or ignored some conventions. I would like to hear feedback on my code.


import nltk
import random
file_name = input()
file = open(file_name, "r", encoding="utf-8")
# nltk.trigrams returns a list of 3-tuples
trigrams = list(nltk.trigrams(file.read().split()))
file.close()
model = {}
for trigram in trigrams:
 head = trigram[0] + " " + trigram[1]
 tail = trigram[2]
 model.setdefault(head, {})
 model[head].setdefault(tail, 0)
 model[head][tail] += 1
possible_starting_heads = []
sentence_ending_punctuation = (".", "!", "?")
for key in model.keys():
 if key[0].isupper() and not key.split(" ")[0].endswith(sentence_ending_punctuation):
 possible_starting_heads.append(key)
# Generate 10 pseudo-sentences based on model
for _ in range(10):
 tokens = []
 # Chooses a random starting head from list
 head = random.choice(possible_starting_heads)
 # print("Head: ", head)
 tokens.append(head)
 while True:
 possible_tails = list(model[head].keys())
 weights = list(model[head].values())
 # Randomly select elements from list taking their weights into account
 most_probable_tail = random.choices(possible_tails, weights, k=1)[0]
 # print("Most probable tail: ", most_probable_tail)
 if most_probable_tail.endswith(sentence_ending_punctuation) and len(tokens) >= 5:
 tokens.append(most_probable_tail)
 # print("Chosen tail and ending sentence: ", most_probable_tail)
 break
 elif not most_probable_tail.endswith(sentence_ending_punctuation):
 tokens.append(most_probable_tail)
 # print("Chosen tail: ", most_probable_tail)
 head = head.split(" ")[1] + " " + most_probable_tail
 elif most_probable_tail.endswith(sentence_ending_punctuation) and len(tokens) < 5:
 # print("Ignoring tail: ", most_probable_tail)
 tokens = []
 head = random.choice(possible_starting_heads)
 tokens.append(head)
 pseudo_sentence = " ".join(tokens)
 print(pseudo_sentence)
```

Question 2

Here's a kinda stream-of-consciousness review:

reading the file

input() takes an argument, typically a prompt or question, so the user knows enter something. You could also check sys.argv for a filename on the command line.

Avoid using names of built in functions (e.g., file) for a variable name.

open() is a context manager, so it can be used in a with statement. In the current code, an exception anywhere in list(nltk.trigrams(file.read().split())) could result in the file being left open. Using a with statement ensures the file is closed:

with open(file_name, "r", encoding="utf-8") as input_file:
 # nltk.trigrams returns a list of 3-tuples
 trigrams = list(nltk.trigrams(input_file.read().split()))

`model` data structure

Get to know the collections module in the standard library. Here, defaultdict and Counter would be useful.

A tuple can be a dictionary key, so concatenating the first two items in a trigram is not necessary.

import collections
model = collections.defaultdict(collections.Counter)
for trigram in trigrams:
 model[trigram[:2]].update(trigram[2:])

Or, to be a little clearer:

for word1, word2, word3 in trigrams:
 model[(word1, word2)].update((word3,))

Then, first words can be collected at the same time:

possible_starting_heads = collections.Counter()
for word1, word2, word3 in trigrams:
 model[(word1, word2)].update((word3,))
 if word1[0].isupper() and not word1.endswith(sentence_ending_punctuation):
 possible_starting_heads.update((word1, word2))

generating sentences

If you intend to make lots of sentences, recreating lists of candidate tails and their weights every time may slow things down. Consider restructuring the model to better suit the end use. This can be done all at once:

new_model = {}
for key, counter in model.items():
 new_model[key] = (list(counter.keys()), list(accumulate(counter.values())))

most_probable_tail is a misnomer; it's the chosen tail.

The logic of the if ... elif ... elif statement is not easy to follow, and .endswith(sentence_ending_punctuation) potentially gets called three times.

tokens = []
# Chooses a random starting head from list
head = random.choice(possible_starting_heads)
# print("Head: ", head)
tokens.append(head)
while True:
 possible_tails, weights = model[head]
 chosen_tail = random.choices(possible_tails, cum_weights=weights, k=1)
 if not chosen_tail.endswith(sentence_ending_punctuation):
 tokens.append(most_probable_tail)
 head = (head[1], chosen_tail)
 else:
 if len(tokens) >= 5:
 tokens.append(chosen_tail)
 break
 else:
 tokens = []
 head = random.choice(possible_starting_heads)
 tokens.append(head)
pseudo_sentence = " ".join(tokens)

other thoughts

You can store the first word of a sentence in the model with a key of ('', '') and the second word with a key of ('', word1). Then they don't need to be handled separately when generating the sentences.

NLTK has a function word_tokenize() that might be helpful. It breaks text into tokens such as words, numbers, and punctuation. It recognized things like Dr.

Question 3

I double-checked after reading your answer, and file has been removed from the built-in functions somewhere between Python 2.7 and 3.5, neither of which is still supported. There's no harm avoiding using it as a variable using a somewhat recent version of Python but it shouldn't wreck havoc when that happens.

Question 4

In the following, I assume that you use Python3 rather than Python2, though I can't say for sure that it makes a difference.

First, when you parse your file, you could use a context manager:

file_name = input()
with open(file_name, "r", encoding="utf-8") as file:
 # nltk.trigrams returns a list of 3-tuples
 trigrams = list(nltk.trigrams(file.read().split()))

That way, even if an exception is raised by the assignment to trigrams (in either of the 4 method calls), the file stream is properly closed.

You could simplify your model generation using defaultdict from the collections package. Your way of doing it is actually wrong and I can't say for sure that this option is more pythonic but it might be interesting to know.

import collections
model = collections.defaultdict(lambda : collections.defaultdict(int))
for trigram in trigrams:
 head = trigram[0] + " " + trigram[1]
 tail = trigram[2]
 model[head][tail] += 1

This does not change the behavior of your algorithm, it just feels a bit simpler to me.

But you can do something more memory-efficient:

import collections
model = collections.defaultdict(lambda : collections.defaultdict(int))
file_name = input()
with open(file_name, "r", encoding="utf-8") as file:
 # nltk.trigrams returns a list of 3-tuples
 trigrams = nltk.trigrams(file.read().split())
 for trigram in trigrams:
 head = trigram[0] + " " + trigram[1]
 tail = trigram[2]
 model[head][tail] += 1

Since nltk.trigrams returns an iterator, and since you only use it once, you don't actually need to store it in a list (an operation that will take some time and memory to copy everything from the iterator into the list) before iterating over it. You could even completely remove the trigrams variable and directly do for... in nltk.trigrams....

Finally, in your while loop, your most_probable_tail is misnamed: it is not the most probable one, it is the one your algorithm randomly selected using a possibly non-uniform law. I would rather call it candidate_tail, or selected_tail, or, even better, simply tail.

RootTwo RootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Answer 1 · 2021-08-30 20:51:49Z

Here's a kinda stream-of-consciousness review:

reading the file

input() takes an argument, typically a prompt or question, so the user knows enter something. You could also check sys.argv for a filename on the command line.

Avoid using names of built in functions (e.g., file) for a variable name.

open() is a context manager, so it can be used in a with statement. In the current code, an exception anywhere in list(nltk.trigrams(file.read().split())) could result in the file being left open. Using a with statement ensures the file is closed:

with open(file_name, "r", encoding="utf-8") as input_file:
 # nltk.trigrams returns a list of 3-tuples
 trigrams = list(nltk.trigrams(input_file.read().split()))

`model` data structure

Get to know the collections module in the standard library. Here, defaultdict and Counter would be useful.

A tuple can be a dictionary key, so concatenating the first two items in a trigram is not necessary.

import collections
model = collections.defaultdict(collections.Counter)
for trigram in trigrams:
 model[trigram[:2]].update(trigram[2:])

Or, to be a little clearer:

for word1, word2, word3 in trigrams:
 model[(word1, word2)].update((word3,))

Then, first words can be collected at the same time:

possible_starting_heads = collections.Counter()
for word1, word2, word3 in trigrams:
 model[(word1, word2)].update((word3,))
 if word1[0].isupper() and not word1.endswith(sentence_ending_punctuation):
 possible_starting_heads.update((word1, word2))

generating sentences

If you intend to make lots of sentences, recreating lists of candidate tails and their weights every time may slow things down. Consider restructuring the model to better suit the end use. This can be done all at once:

new_model = {}
for key, counter in model.items():
 new_model[key] = (list(counter.keys()), list(accumulate(counter.values())))

most_probable_tail is a misnomer; it's the chosen tail.

The logic of the if ... elif ... elif statement is not easy to follow, and .endswith(sentence_ending_punctuation) potentially gets called three times.

tokens = []
# Chooses a random starting head from list
head = random.choice(possible_starting_heads)
# print("Head: ", head)
tokens.append(head)
while True:
 possible_tails, weights = model[head]
 chosen_tail = random.choices(possible_tails, cum_weights=weights, k=1)
 if not chosen_tail.endswith(sentence_ending_punctuation):
 tokens.append(most_probable_tail)
 head = (head[1], chosen_tail)
 else:
 if len(tokens) >= 5:
 tokens.append(chosen_tail)
 break
 else:
 tokens = []
 head = random.choice(possible_starting_heads)
 tokens.append(head)
pseudo_sentence = " ".join(tokens)

other thoughts

You can store the first word of a sentence in the model with a key of ('', '') and the second word with a key of ('', word1). Then they don't need to be handled separately when generating the sentences.

NLTK has a function word_tokenize() that might be helpful. It breaks text into tokens such as words, numbers, and punctuation. It recognized things like Dr.

I double-checked after reading your answer, and file has been removed from the built-in functions somewhere between Python 2.7 and 3.5, neither of which is still supported. There's no harm avoiding using it as a variable using a somewhat recent version of Python but it shouldn't wreck havoc when that happens.

Anab Anab 3851 silver badge8 bronze badges · Answer 2 · 2021-08-30 19:28:04Z

In the following, I assume that you use Python3 rather than Python2, though I can't say for sure that it makes a difference.

First, when you parse your file, you could use a context manager:

file_name = input()
with open(file_name, "r", encoding="utf-8") as file:
 # nltk.trigrams returns a list of 3-tuples
 trigrams = list(nltk.trigrams(file.read().split()))

That way, even if an exception is raised by the assignment to trigrams (in either of the 4 method calls), the file stream is properly closed.

You could simplify your model generation using defaultdict from the collections package. Your way of doing it is actually wrong and I can't say for sure that this option is more pythonic but it might be interesting to know.

import collections
model = collections.defaultdict(lambda : collections.defaultdict(int))
for trigram in trigrams:
 head = trigram[0] + " " + trigram[1]
 tail = trigram[2]
 model[head][tail] += 1

This does not change the behavior of your algorithm, it just feels a bit simpler to me.

But you can do something more memory-efficient:

import collections
model = collections.defaultdict(lambda : collections.defaultdict(int))
file_name = input()
with open(file_name, "r", encoding="utf-8") as file:
 # nltk.trigrams returns a list of 3-tuples
 trigrams = nltk.trigrams(file.read().split())
 for trigram in trigrams:
 head = trigram[0] + " " + trigram[1]
 tail = trigram[2]
 model[head][tail] += 1

Since nltk.trigrams returns an iterator, and since you only use it once, you don't actually need to store it in a list (an operation that will take some time and memory to copy everything from the iterator into the list) before iterating over it. You could even completely remove the trigrams variable and directly do for... in nltk.trigrams....

Finally, in your while loop, your most_probable_tail is misnamed: it is not the most probable one, it is the one your algorithm randomly selected using a possibly non-uniform law. I would rather call it candidate_tail, or selected_tail, or, even better, simply tail.

Stack Exchange Network

Markov text generator program in Python

2 Answers 2

reading the file

`model` data structure

generating sentences

other thoughts

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Markov text generator program in Python

2 Answers 2

reading the file

model data structure

generating sentences

other thoughts

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`model` data structure