Markov Chain in Python

Question 1

For learning purposes, I'm trying to implement a Markov Chain from scratch in Python.

The goal is, provided a file with a list of words, and some sequence, to predict the next letter according the the probability computed from the list of words.

My questions are not related to the Python syntax (for instance I know that I can do better than defaultdict with Counter at some places but I don't care about that), rather to the Markov Chain algorithm in itself, and in particular:

is it a correct/efficient way to implement a MC?
if not, how could this algorithm be more efficient?
is the method of inserting state_length - 1 blanks before each word a good practice to allow searching sequences smaller than state_length?
is the way of representing <EXIT> a good practice?

The file mdp_sequences.txt goes like this:

HELLO
CELLO
HELL
HELLO
HELL
SHELL
YELLOW
HELLO
YELLOW
...

1. Parsing

The first part of the program reads the file and computes a count of all words:

from collections import Counter, defaultdict
from pprint import pprint
states = []
state_length = 4
# read and parse csv file into a counted dict of sequences
sequences = defaultdict(int)
with open('mdp_sequences.txt') as f:
 for sequence in f:
 seq = " " * (state_length - 1) + sequence.rstrip()
 sequences[seq] += 1

The resulting sequences dict is just a count of all words found in mdp_sequences.txt, with the particularity that is has state_length - 1 spaces inserted before each word:

sequences = {
 ' HELLO': 342,
 ' CELLO': 117,
 ' HELL': 200,
 ' SHELL': 120,
 ' YELLOW': 250,
 ...
}

2. Enumerating states

Next we enumerate each possible state:

for seq, count in sequences.items():
 for i in range(len(seq) - state_length + 1):
 for j in range(int(count)):
 states.append(seq[i:][:state_length])
counted_states = dict(Counter(states))

At this point, the counted_states dict goes like:

counted_states = {
 ' H': 542,
 ' HE': 542,
 ' HEL': 542,
 'HELL': 662,
 ' C': 117,
 ' CE': 117,
 ' CEL': 117,
 'CELL': 117,
 'ELLO': 709,
 ...
}

3. Building the Markov Chain

Now we build the Markov Chain itself:

mc = defaultdict(lambda: defaultdict(float))
for state, scount in counted_states.items():
 for seq, wcount in sequences.items():
 if state in seq:
 for i in range(len(seq) - state_length + 1):
 if seq[i:][:state_length] == state:
 next_step = seq[i+state_length:][:1]
 if next_step == '':
 next_state = '<EXIT>'
 else:
 next_state = seq[i+1:][:state_length]
 probability = float(wcount) / float(scount)
 mc[state][next_state] += probability

The mc dict is a mapping of each state to its possible following states, with the probability for each possible following state:

mc = {
 ' H': {
 ' HE': 1.0
 },
 ' HE': {
 ' HEL': 1.0
 },
 ' HEL': {
 'HELL': 1.0
 },
 'HELL': {
 'ELLO': 0.5167,
 '<EXIT>': 0.4833
 },
 'ELLO': {
 'LLOW': 0.3526,
 '<EXIT>': 0.6474
 },
 'LLOW': {
 '<EXIT>': 1.0
 },
 ...
}

4. Testing

Finally, we ask the user for a particular sequence, and list the most probable following states based on this sequence, considering only state_length - 1:

user_input = input("Enter the current sequence: ")
user_input = " " * (state_length - 1) + user_input
lookup = user_input[-4:]
print("Considering the following sequence: '{}'".format(lookup))
print("Most probable next states are:")
for k, v in sorted(mc[lookup].items(), key=lambda x: x[1], reverse=True):
 if k == '<EXIT>':
 print(' <EXIT> - p: {}'.format(round(v, 3)))
 else:
 print(' {} (next step: {}) - p: {}'.format(k, k[-1:], round(v, 3)))

A this point, if the user inputs a sequence with length > state_length, we only consider state_length last letters, for instance if the user inputs "NUTSHELL" with state_length = 4, we consider "HELL"

If the user inputs a sequence with length < state_length, we insert spaces before to match state_length, for instance if the user inputs "HE" with state_length = 4, we consider " HE"

Full example with "NUTSHELL" and state_length = 4:

consider only "HELL"
lookup for mc["HELL"]

Output for this example is:

"ELLO" (next step: O) - p: 0.517
<EXIT> - p: 0.483

Full example with "H" and state_length = 4:

consider only " H"
lookup for mc[" H"]

Output for this example is:

" HE" (next step: E) - p: 1.0

Question 2

Code is easier to understand, test, and reuse, if you divide it into functions with well-documented inputs and outputs, for example you might choose functions build_markov_chain and apply_markov_chain.
Instead of a defaultdict(int), you could just use a Counter.
There's no need pad the words with spaces at the left — with a few tweaks to the code you can use 'H' instead of ' H' and so on.
Representing the terminal state by a special string '<EXIT> is risky — what if this string occurred in your data? It would be better to use a distinguished Python object, for example None or a unique object:
```
EXIT = object() # unique object representing the end of a word
```
The Markov chain representation is redundant — when 'ABCD' is followed by 'BCDE', you know that the three letters BCD must be the same. So all you need to remember in the chain is the single letter 'E'.
It's not necessary to convert number to float before dividing:
```
probability = float(wcount) / float(scount)
```
Instead, write:
```
probability = wcount / scount
```
(If you are still stuck on Python 2, then use from __future__ import division.)

The algorithm proceeds in three steps: (i) read the file and build the sequences dictionary; (ii) use the sequences dictionary to build the counted_states dictionary; and (iii) use the counted_states and sequences dictionaries to build the Markov chain. But it would be simpler to build the chain in two steps: (i) count the successors to each state as you go through the input; and (ii) convert the counts to probabilities. Like this:

from collections import Counter, defaultdict
def build_markov_chain(filename='mdp_sequences.txt', n=4):
 """Read words from a file and build a Markov chain.
 Argument:
 filename -- Name of file containing words.
 n -- Number of characters in the states.
 Returns: map from n-character states to map from successors (or
 None indicating end-of-word) to probabilities.
 """
 # Map from state to map from successor to count.
 counts = defaultdict(Counter)
 with open(filename) as f:
 for line in f:
 line = line.rstrip()
 for i in range(1, len(line)):
 counts[line[max(0, i - n):i]][line[i]] += 1
 counts[line[max(0, len(line) - n):]][None] += 1
 # Convert counts to probabilities.
 probabilities = defaultdict(lambda: defaultdict(float))
 for state, successors in counts.items():
 total = sum(successors.values())
 probabilities[state] = {s: c / total for s, c in successors.items()}
 return probabilities

And so:

>>> from pprint import pprint
>>> pprint(build_markov_chain())
defaultdict(<function build_markov_chain.<locals>.<lambda> at 0x110697950>,
 {'C': {'E': 1.0},
 'CE': {'L': 1.0},
 'CEL': {'L': 1.0},
 'CELL': {'O': 1.0},
 'ELLO': {None: 0.6666666666666666, 'W': 0.3333333333333333},
 'H': {'E': 1.0},
 'HE': {'L': 1.0},
 'HEL': {'L': 1.0},
 'HELL': {None: 0.5, 'O': 0.5},
 'LLOW': {None: 1.0},
 'S': {'H': 1.0},
 'SH': {'E': 1.0},
 'SHE': {'L': 1.0},
 'SHEL': {'L': 1.0},
 'Y': {'E': 1.0},
 'YE': {'L': 1.0},
 'YEL': {'L': 1.0},
 'YELL': {'O': 1.0}})

Question 3

I was just about to point roughly the same points in an answer. I would have just added a note on using either str.lower() or str.upper() to avoid inconsistencie between the file and the user input (or even within the file).

score 4 · Accepted Answer · 2016-12-19 20:43:03Z

Code is easier to understand, test, and reuse, if you divide it into functions with well-documented inputs and outputs, for example you might choose functions build_markov_chain and apply_markov_chain.
Instead of a defaultdict(int), you could just use a Counter.
There's no need pad the words with spaces at the left — with a few tweaks to the code you can use 'H' instead of ' H' and so on.
Representing the terminal state by a special string '<EXIT> is risky — what if this string occurred in your data? It would be better to use a distinguished Python object, for example None or a unique object:
```
EXIT = object() # unique object representing the end of a word
```
The Markov chain representation is redundant — when 'ABCD' is followed by 'BCDE', you know that the three letters BCD must be the same. So all you need to remember in the chain is the single letter 'E'.
It's not necessary to convert number to float before dividing:
```
probability = float(wcount) / float(scount)
```
Instead, write:
```
probability = wcount / scount
```
(If you are still stuck on Python 2, then use from __future__ import division.)

The algorithm proceeds in three steps: (i) read the file and build the sequences dictionary; (ii) use the sequences dictionary to build the counted_states dictionary; and (iii) use the counted_states and sequences dictionaries to build the Markov chain. But it would be simpler to build the chain in two steps: (i) count the successors to each state as you go through the input; and (ii) convert the counts to probabilities. Like this:

from collections import Counter, defaultdict
def build_markov_chain(filename='mdp_sequences.txt', n=4):
 """Read words from a file and build a Markov chain.
 Argument:
 filename -- Name of file containing words.
 n -- Number of characters in the states.
 Returns: map from n-character states to map from successors (or
 None indicating end-of-word) to probabilities.
 """
 # Map from state to map from successor to count.
 counts = defaultdict(Counter)
 with open(filename) as f:
 for line in f:
 line = line.rstrip()
 for i in range(1, len(line)):
 counts[line[max(0, i - n):i]][line[i]] += 1
 counts[line[max(0, len(line) - n):]][None] += 1
 # Convert counts to probabilities.
 probabilities = defaultdict(lambda: defaultdict(float))
 for state, successors in counts.items():
 total = sum(successors.values())
 probabilities[state] = {s: c / total for s, c in successors.items()}
 return probabilities

And so:

>>> from pprint import pprint
>>> pprint(build_markov_chain())
defaultdict(<function build_markov_chain.<locals>.<lambda> at 0x110697950>,
 {'C': {'E': 1.0},
 'CE': {'L': 1.0},
 'CEL': {'L': 1.0},
 'CELL': {'O': 1.0},
 'ELLO': {None: 0.6666666666666666, 'W': 0.3333333333333333},
 'H': {'E': 1.0},
 'HE': {'L': 1.0},
 'HEL': {'L': 1.0},
 'HELL': {None: 0.5, 'O': 0.5},
 'LLOW': {None: 1.0},
 'S': {'H': 1.0},
 'SH': {'E': 1.0},
 'SHE': {'L': 1.0},
 'SHEL': {'L': 1.0},
 'Y': {'E': 1.0},
 'YE': {'L': 1.0},
 'YEL': {'L': 1.0},
 'YELL': {'O': 1.0}})

I was just about to point roughly the same points in an answer. I would have just added a note on using either str.lower() or str.upper() to avoid inconsistencie between the file and the user input (or even within the file).

Stack Exchange Network

Markov Chain in Python

1. Parsing

2. Enumerating states

3. Building the Markov Chain

4. Testing

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Markov Chain in Python

1. Parsing

2. Enumerating states

3. Building the Markov Chain

4. Testing

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions