For learning purposes, I'm trying to implement a Markov Chain from scratch in Python.
The goal is, provided a file with a list of words, and some sequence, to predict the next letter according the the probability computed from the list of words.
My questions are not related to the Python syntax (for instance I know that I can do better than defaultdict
with Counter
at some places but I don't care about that), rather to the Markov Chain algorithm in itself, and in particular:
- is it a correct/efficient way to implement a MC?
- if not, how could this algorithm be more efficient?
- is the method of inserting
state_length - 1
blanks before each word a good practice to allow searching sequences smaller thanstate_length
? - is the way of representing
<EXIT>
a good practice?
The file mdp_sequences.txt
goes like this:
HELLO
CELLO
HELL
HELLO
HELL
SHELL
YELLOW
HELLO
YELLOW
...
1. Parsing
The first part of the program reads the file and computes a count of all words:
from collections import Counter, defaultdict
from pprint import pprint
states = []
state_length = 4
# read and parse csv file into a counted dict of sequences
sequences = defaultdict(int)
with open('mdp_sequences.txt') as f:
for sequence in f:
seq = " " * (state_length - 1) + sequence.rstrip()
sequences[seq] += 1
The resulting sequences
dict is just a count of all words found in mdp_sequences.txt
, with the particularity that is has state_length - 1
spaces inserted before each word:
sequences = {
' HELLO': 342,
' CELLO': 117,
' HELL': 200,
' SHELL': 120,
' YELLOW': 250,
...
}
2. Enumerating states
Next we enumerate each possible state:
for seq, count in sequences.items():
for i in range(len(seq) - state_length + 1):
for j in range(int(count)):
states.append(seq[i:][:state_length])
counted_states = dict(Counter(states))
At this point, the counted_states
dict goes like:
counted_states = {
' H': 542,
' HE': 542,
' HEL': 542,
'HELL': 662,
' C': 117,
' CE': 117,
' CEL': 117,
'CELL': 117,
'ELLO': 709,
...
}
3. Building the Markov Chain
Now we build the Markov Chain itself:
mc = defaultdict(lambda: defaultdict(float))
for state, scount in counted_states.items():
for seq, wcount in sequences.items():
if state in seq:
for i in range(len(seq) - state_length + 1):
if seq[i:][:state_length] == state:
next_step = seq[i+state_length:][:1]
if next_step == '':
next_state = '<EXIT>'
else:
next_state = seq[i+1:][:state_length]
probability = float(wcount) / float(scount)
mc[state][next_state] += probability
The mc
dict is a mapping of each state to its possible following states,
with the probability for each possible following state:
mc = {
' H': {
' HE': 1.0
},
' HE': {
' HEL': 1.0
},
' HEL': {
'HELL': 1.0
},
'HELL': {
'ELLO': 0.5167,
'<EXIT>': 0.4833
},
'ELLO': {
'LLOW': 0.3526,
'<EXIT>': 0.6474
},
'LLOW': {
'<EXIT>': 1.0
},
...
}
4. Testing
Finally, we ask the user for a particular sequence, and list the most probable following states based on this sequence, considering only state_length - 1
:
user_input = input("Enter the current sequence: ")
user_input = " " * (state_length - 1) + user_input
lookup = user_input[-4:]
print("Considering the following sequence: '{}'".format(lookup))
print("Most probable next states are:")
for k, v in sorted(mc[lookup].items(), key=lambda x: x[1], reverse=True):
if k == '<EXIT>':
print(' <EXIT> - p: {}'.format(round(v, 3)))
else:
print(' {} (next step: {}) - p: {}'.format(k, k[-1:], round(v, 3)))
A this point, if the user inputs a sequence with length > state_length
,
we only consider state_length
last letters,
for instance if the user inputs "NUTSHELL" with state_length = 4
,
we consider "HELL"
If the user inputs a sequence with length < state_length
,
we insert spaces before to match state_length
,
for instance if the user inputs "HE" with state_length = 4
,
we consider " HE"
Full example with "NUTSHELL" and state_length = 4
:
- consider only "HELL"
- lookup for
mc["HELL"]
Output for this example is:
"ELLO" (next step: O) - p: 0.517
<EXIT> - p: 0.483
Full example with "H" and state_length = 4
:
- consider only " H"
- lookup for
mc[" H"]
Output for this example is:
" HE" (next step: E) - p: 1.0
1 Answer 1
Code is easier to understand, test, and reuse, if you divide it into functions with well-documented inputs and outputs, for example you might choose functions
build_markov_chain
andapply_markov_chain
.Instead of a
defaultdict(int)
, you could just use aCounter
.There's no need pad the words with spaces at the left — with a few tweaks to the code you can use
'H'
instead of' H'
and so on.Representing the terminal state by a special string
'<EXIT>
is risky — what if this string occurred in your data? It would be better to use a distinguished Python object, for exampleNone
or a unique object:EXIT = object() # unique object representing the end of a word
The Markov chain representation is redundant — when
'ABCD'
is followed by'BCDE'
, you know that the three lettersBCD
must be the same. So all you need to remember in the chain is the single letter'E'
.It's not necessary to convert number to
float
before dividing:probability = float(wcount) / float(scount)
Instead, write:
probability = wcount / scount
(If you are still stuck on Python 2, then use
from __future__ import division
.)The algorithm proceeds in three steps: (i) read the file and build the
sequences
dictionary; (ii) use thesequences
dictionary to build thecounted_states
dictionary; and (iii) use thecounted_states
andsequences
dictionaries to build the Markov chain. But it would be simpler to build the chain in two steps: (i) count the successors to each state as you go through the input; and (ii) convert the counts to probabilities. Like this:from collections import Counter, defaultdict def build_markov_chain(filename='mdp_sequences.txt', n=4): """Read words from a file and build a Markov chain. Argument: filename -- Name of file containing words. n -- Number of characters in the states. Returns: map from n-character states to map from successors (or None indicating end-of-word) to probabilities. """ # Map from state to map from successor to count. counts = defaultdict(Counter) with open(filename) as f: for line in f: line = line.rstrip() for i in range(1, len(line)): counts[line[max(0, i - n):i]][line[i]] += 1 counts[line[max(0, len(line) - n):]][None] += 1 # Convert counts to probabilities. probabilities = defaultdict(lambda: defaultdict(float)) for state, successors in counts.items(): total = sum(successors.values()) probabilities[state] = {s: c / total for s, c in successors.items()} return probabilities
And so:
>>> from pprint import pprint >>> pprint(build_markov_chain()) defaultdict(<function build_markov_chain.<locals>.<lambda> at 0x110697950>, {'C': {'E': 1.0}, 'CE': {'L': 1.0}, 'CEL': {'L': 1.0}, 'CELL': {'O': 1.0}, 'ELLO': {None: 0.6666666666666666, 'W': 0.3333333333333333}, 'H': {'E': 1.0}, 'HE': {'L': 1.0}, 'HEL': {'L': 1.0}, 'HELL': {None: 0.5, 'O': 0.5}, 'LLOW': {None: 1.0}, 'S': {'H': 1.0}, 'SH': {'E': 1.0}, 'SHE': {'L': 1.0}, 'SHEL': {'L': 1.0}, 'Y': {'E': 1.0}, 'YE': {'L': 1.0}, 'YEL': {'L': 1.0}, 'YELL': {'O': 1.0}})
-
1\$\begingroup\$ I was just about to point roughly the same points in an answer. I would have just added a note on using either
str.lower()
orstr.upper()
to avoid inconsistencie between the file and the user input (or even within the file). \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年12月19日 20:55:32 +00:00Commented Dec 19, 2016 at 20:55