I read about how markov-chains were handy at creating text-generators and wanted to give it a try in python.
I'm not sure if this is the proper way to make a markov-chain. I've left comments in the code. Any feedback would be appreciated.
import random
def Markov(text_file):
with open(text_file) as f: # provide a text-file to parse
data = f.read()
data = [i for i in data.split(' ') if i != ''] # create a list of all words
data = [i.lower() for i in data if i.isalpha()] # i've been removing punctuation
markov = {i:[] for i in data} # i create a dict with the words as keys and empty lists as values
pos = 0
while pos < len(data) - 1: # add a word to the word-key's list if it immediately follows that word
markov[data[pos]].append(data[pos+1])
pos += 1
new = {k:v for k,v in zip(range(len(markov)), [i for i in markov])} # create another dict for the seed to match up with
length_sentence = random.randint(15, 20) # create a random length for a sentence stopping point
seed = random.randint(0, len(new) - 1) # randomly pick a starting point
sentence_data = [new[start_index]] # use that word as the first word and starting point
current_word = new[start_index]
while len(sentence_data) < length_sentence:
next_index = random.randint(0, len(markov[current_word]) - 1) # randomly pick a word from the last words list.
next_word = markov[current_word][next_index]
sentence_data.append(next_word)
current_word = next_word
return ' '.join([i for i in sentence_data])
1 Answer 1
import random
def Markov(text_file):
Python convention is to name function lowercase_with_underscores. I'd also probably have this function take a string as input rather then a filename. That way this function doesn't make assumptions about where the data is coming from
with open(text_file) as f: # provide a text-file to parse
data = f.read()
data is a bit too generic. I'd call it text.
data = [i for i in data.split(' ') if i != ''] # create a list of all words
data = [i.lower() for i in data if i.isalpha()] # i've been removing punctuation
Since ''.isalpha() == False, you could easily combine these two lines
markov = {i:[] for i in data} # i create a dict with the words as keys and empty lists as values
pos = 0
while pos < len(data) - 1: # add a word to the word-key's list if it immediately follows that word
markov[data[pos]].append(data[pos+1])
pos += 1
Whenever possible, avoid iterating over indexes. In this case I'd use
for before, after in zip(data, data[1:]):
markov[before] += after
I think that's much clearer.
new = {k:v for k,v in zip(range(len(markov)), [i for i in markov])} # create another dict for the seed to match up with
[i for i in markov]
can be written list(markov)
and it produces a copy of the markov list. But there is no reason to making a copy here, so just pass markov directly.
zip(range(len(x)), x)
can be written as enumerate(x)
{k:v for k,v in x}
is the same as dict(x)
So that whole line can be written as
new = dict(enumerate(markov))
But that's a strange construct to build. Since you are indexing with numbers, it'd make more sense to have a list. An equivalent list would be
new = markov.keys()
Which gives you a list of the keys
length_sentence = random.randint(15, 20) # create a random length for a sentence stopping point
seed = random.randint(0, len(new) - 1) # randomly pick a starting point
Python has a function random.randrange such that random.randrange(x) = random.randint(0, x -1) It good to use that when selecting from a range of indexes like this
sentence_data = [new[start_index]] # use that word as the first word and starting point
current_word = new[start_index]
To select a random item from a list, use random.choice
, so in this case I'd use
current_word = random.choice(markov.keys())
while len(sentence_data) < length_sentence:
Since you know how many iterations you'll need I'd use a for loop here.
next_index = random.randint(0, len(markov[current_word]) - 1) # randomly pick a word from the last words list.
next_word = markov[current_word][next_index]
Instead do next_word = random.choice(markov[current_word])
sentence_data.append(next_word)
current_word = next_word
return ' '.join([i for i in sentence_data])
Again, no reason to be doing this i for i
dance. Just use ' '.join(sentence_data)
-
1\$\begingroup\$ thanks for taking the time to respond. Your markups will be very helpful. \$\endgroup\$tijko– tijko2013年03月23日 18:02:20 +00:00Commented Mar 23, 2013 at 18:02
-
1\$\begingroup\$ It's a bit difficult to figure out which comment belongs to which code snippet (above or below?). Also sometimes I think you wanted to have two separate code snippets, but they were merged because there was no text in between. \$\endgroup\$mkrieger1– mkrieger12015年06月02日 11:59:47 +00:00Commented Jun 2, 2015 at 11:59