Latent Dirichlet Allocation in Python

Question 1

I've recently finished writing a "simple-as-possible" LDA code in Python.

The theory from which I've developed my code can be found in the book Computer Vision by Simon Prince, free (courtesy of Simon Prince) pdf can be found on his website: http://computervisionmodels.com/ - Chapter 20 talks about LDA. Applied to computer vision, he gives the notation i as the number of images, m as the number of parts or topics, and w as the words.

After re-doing the code today, I've found I get results which I expect (sorting a set of words into two or more topics). I've been comparing to other LDA code which is doing this more accurately and I'm wondering if the way I'm writing the code is making it inefficient. Any feedback appreciated :)

import numpy as np
corpus = open('corpus3.txt').read()
#Creatinig dictionary of unique terms, indexed and counted
dic = {} 
arr=[]
arrv=[]
for item in corpus.split(): 
 if item in dic:
 dic[item] += 1 
 else: 
 dic[item] = 1 
arr = dic.keys()
arrv= dic.values()
arrid=range(0,len(arr))
#Replacing actual words in doc with the word id's
Imgvv=[]
for w in corpus.split():
 for i in arrid:
 if w == arr[i]:
 Imgvv.append(i)
Imgv = [Imgvv] # Array of (array of) words in documents (replaced with id's)
Vocab = arr #Vocab of unique terms
I = len(Imgv) #Image number
M = 2 # Part number - hardwired (supervised learning)
V = len(Vocab) #vocabulary
#Dirichlet constants
alpha=0.5
beta=0.5
#Initialise the 4 counters used in Gibbs sampling
Na = np.zeros((I, M)) + alpha # umber of words for each document, topic combo i.e 11, 12,13 -> 21,22,23 array.
Nb = np.zeros(I) + M*alpha # number of words in each image
Nc = np.zeros((M, V)) + beta # word count of each topic and vocabulary, times the word is in topic M and is of vocab number 1,2,3, etc..
Nd = np.zeros(M) + V*beta # number of words in each topic
m_w = [] #topic of the current word
m_i_w=[] # topic of the image of the word 
#Filling up counters
for i,img in enumerate(Imgv):
 for w in img:
 m = np.random.randint(0,M)
 m_w.append(m)
 Na[i,m] += 1
 Nb[i] += 1
 Nc[m,w] += 1
 Nd[m] += 1 
 m_i_w.append(np.array(m_w)) #creating a relationship between topic to word per doc
#Gibbs Sampling
m_i=[]
q = np.zeros(M) 
for t in xrange(500): #Iterations 
 for i,img in enumerate(Imgv): #in the Imgv matrix there are i documents which are arrays (img) filled with words
 m_w = m_i_w[i] #Finding topic of word
 Nab = Na[i] #Taking ith row of the Na counter (array)
 for n, w in enumerate(img): #in img there are n words of value w
 m = m_w[n] # From the intialised/appended topic-word value we draw the "guessed" topic
 Nab[m] -= 1 
 Nb[i] -= 1 #In Gibbs Samp. we compute for all values except the current (x,y) position
 Nc[m,w] -= 1 #So we move the counter of this positon down one, compute
 Nd[m] -= 1 #And then add one back after reloading the topic for the word
 q = (Nab*(Nc[:,w]))/((Nb[i])*(Nd)) # computing topic probability
 q_new = np.random.multinomial(1, q/q.sum()).argmax() # choosing new topic based on this
 m_w[n] = q_new # assigning word to topic, replacing the guessed topic from init.
 Nab[q_new] += 1 #Putting the counters back to original value before redoing process.
 Nb[i] += 1
 Nc[q_new,w] += 1
 Nd[q_new] += 1
WordDist = Nc/Nd[:, np.newaxis] # This gives us the words per topic
for m in xrange(M): #Displaying results
 for w in np.argsort(-WordDist[m])[:20]:
 print("Topic", m, Vocab[w], WordDist[m,w],arrv[w])

Question 2

Quickly reading the wikipedia link you provided, it sounds like your implementation slightly differ from the theory, especially the initialization part. I’m not expert, however, and will not try to dig it further. Just going for a style review:

Keep spaces consistent

As it currently stand, your spaces around = or , are not consistent and can be quite uncomfortable to read. Some calculus is also pretty dense and could gain some readability by using spaces.

Use functions

You could more easily test your script and figure why you don't get the results you're expecting if you split it in small functions. Reading files, indexing words, initializing computation and performing LDA seems to be the 4 tasks you do out of pretty-printing the results.

Be nice to the memory

Twice in your script, you use more than three times the space needed to store 'corpus.txt': corpus, dic and corpus.split() all contains each word of 'corpus.txt' with additional data. I don't know how big is the file but this is only one file, LDA allows for more (namely I to stay in your code context).

Since you're already reading through the content of the file twice, why not open the files twice and process them line by line each time to avoid overflooding the memory?

And talking about handling more files, your code should allow that more easily. Using functions with variable number of arguments is a way of doing it.

Constants

It is best to define them at the top of the file with an ALL_CAPS name. alpha and beta can qualify (they can be seen as variable since they are parameters of LDA), but more importantly the number of iteration is one of them.

Building data structures

To count elements in an iterable, you can use the collections.Counter class which is a subclass of dict:

words = Counter(data.split())

is more readable and understandable than

words = {}
for w in data.split():
 try:
 words[w] = words[w] + 1
 except KeyError:
 words[w] = 1

As such, it is often recommended to use list-comprehension rather than:

data = []
for elem in other_data:
 data.append(process(elem))

Incomplete improvement

(Based on Rev 2 of the question)

import numpy as np
from collections import Counter
ALPHA = 100
BETA = 5
ITERATIONS = 1000
def read_corpuses(*filenames):
 words = Counter()
 for corpus_file in filenames:
 with open(corpus_file) as corpus:
 words.update(word for line in corpus for word in line.split())
 return words
def compute_image(vocabulary, corpus_filename):
 with open(corpus_filename) as corpus:
 return [vocabulary.index(word) for line in corpus for word in line.split()]
def init_LDA(images, M, V):
 I = len(images)
 Na = np.zeros((I, M)) + ALPHA # umber of words for each document, topic combo i.e 11, 12,13 -> 21,22,23 array.
 Nb = np.zeros(I) + M*ALPHA # number of words in each image
 Nc = np.zeros((M, V)) + BETA # word count of each topic and vocabulary, times the word is in topic M and is of vocab number 1,2,3, etc..
 Nd = np.zeros(M) + V*BETA # number of words in each topic
 def inner(i, w):
 m = np.random.randint(0, M)
 Na[i, m] += 1
 Nb[i] += 1
 Nc[m, w-1] += 1
 Nd[m] += 1
 return m
 return Na, Nb, Nc, Nd, [[inner(i, w) for w in image] for i, image in enumerate(images)]
def LDA(topics, *filenames):
 words = read_corpuses(*filenames)
 vocabulary = words.keys()
 images = [compute_image(vocabulary, corpus) for corpus in filenames]
 Na, Nb, Nc, Nd, topic_of_words_per_image = init_LDA(images, topics, len(vocabulary)) 
 #Gibbs Sampling
 probabilities = np.zeros(topics)
 for _ in xrange(ITERATIONS):
 for i, image in enumerate(images):
 topic_per_word = topic_of_words_per_image[i]
 for n, w in enumerate(image): 
 m = topic_per_word[n] 
 Na[i, m] -= 1
 Nb[i] -= 1
 Nc[m, w-1] -= 1
 Nd[m] -= 1
 # computing topic probability
 probabilities[m] = Na[i, m] * Nc[m, w-1]/(Nb[i] * Nd[m])
 # choosing new topic based on this
 q = np.random.multinomial(1, probabilities/probabilities.sum()).argmax()
 # assigning word to topic
 topic_per_word[n] = q
 Na[i, q] += 1
 Nb[i] += 1
 Nc[q, w-1] += 1
 Nd[q] += 1
 distances = Nc/Nd[:, np.newaxis] #Words by Topic and printing
 return distances, vocabulary, words
if __name__ == '__main__':
 topics = 2
 #Add as many filenames as needed, like LDA(topics, 'corpus1.txt', 'corpus2.txt', 'corpus3.txt')
 distances, vocabulary, words_count = LDA(topics, 'corpus.txt')
 for topic in xrange(topics):
 for word_index in np.argsort(-distances[topic])[:20]:
 word = vocabulary[word_index]
 print "Topic", topic, word, distances[topic, word_index], words_count[word]

Things left to improve

You corpuses might contain words with punctuation and capitalization. You might want to split things according to string.punctuation and lowercase each word to avoid creating several keys for a single word. You may want to filter on the length of the string in doing so, or del words[''] afterward, depending on how you do things.
Naming is not optimal and hard to follow. Better take names of the variables from the LDA theory, like theta, phi instead of Na, Nc. Same goes for loop variables, some are not intuitive to be able to follow your implementation by looking at the theory from your wikipedia link.
As said at the beginning of the post, init_LDA might not compute the correct arrays.

Question 3

Hey, wow - thanks a lot for all the info. I will try an implement this tomorrow and see how it goes. I wasn't using wikipedia to build my program, I put a link in the desc. for the book "Computer Vision" which tackles different methods of image classification, and goes over LDA, which is why my nomenclature is different. I've tried to avoid functions because I found them difficult to debug, vs another part where you can just ## it out, but I should probably adapt :p

score 6 · Answer 1 · 2015-11-03 14:16:54Z

Quickly reading the wikipedia link you provided, it sounds like your implementation slightly differ from the theory, especially the initialization part. I’m not expert, however, and will not try to dig it further. Just going for a style review:

Keep spaces consistent

As it currently stand, your spaces around = or , are not consistent and can be quite uncomfortable to read. Some calculus is also pretty dense and could gain some readability by using spaces.

Use functions

You could more easily test your script and figure why you don't get the results you're expecting if you split it in small functions. Reading files, indexing words, initializing computation and performing LDA seems to be the 4 tasks you do out of pretty-printing the results.

Be nice to the memory

Twice in your script, you use more than three times the space needed to store 'corpus.txt': corpus, dic and corpus.split() all contains each word of 'corpus.txt' with additional data. I don't know how big is the file but this is only one file, LDA allows for more (namely I to stay in your code context).

Since you're already reading through the content of the file twice, why not open the files twice and process them line by line each time to avoid overflooding the memory?

And talking about handling more files, your code should allow that more easily. Using functions with variable number of arguments is a way of doing it.

Constants

It is best to define them at the top of the file with an ALL_CAPS name. alpha and beta can qualify (they can be seen as variable since they are parameters of LDA), but more importantly the number of iteration is one of them.

Building data structures

To count elements in an iterable, you can use the collections.Counter class which is a subclass of dict:

words = Counter(data.split())

is more readable and understandable than

words = {}
for w in data.split():
 try:
 words[w] = words[w] + 1
 except KeyError:
 words[w] = 1

As such, it is often recommended to use list-comprehension rather than:

data = []
for elem in other_data:
 data.append(process(elem))

Incomplete improvement

(Based on Rev 2 of the question)

import numpy as np
from collections import Counter
ALPHA = 100
BETA = 5
ITERATIONS = 1000
def read_corpuses(*filenames):
 words = Counter()
 for corpus_file in filenames:
 with open(corpus_file) as corpus:
 words.update(word for line in corpus for word in line.split())
 return words
def compute_image(vocabulary, corpus_filename):
 with open(corpus_filename) as corpus:
 return [vocabulary.index(word) for line in corpus for word in line.split()]
def init_LDA(images, M, V):
 I = len(images)
 Na = np.zeros((I, M)) + ALPHA # umber of words for each document, topic combo i.e 11, 12,13 -> 21,22,23 array.
 Nb = np.zeros(I) + M*ALPHA # number of words in each image
 Nc = np.zeros((M, V)) + BETA # word count of each topic and vocabulary, times the word is in topic M and is of vocab number 1,2,3, etc..
 Nd = np.zeros(M) + V*BETA # number of words in each topic
 def inner(i, w):
 m = np.random.randint(0, M)
 Na[i, m] += 1
 Nb[i] += 1
 Nc[m, w-1] += 1
 Nd[m] += 1
 return m
 return Na, Nb, Nc, Nd, [[inner(i, w) for w in image] for i, image in enumerate(images)]
def LDA(topics, *filenames):
 words = read_corpuses(*filenames)
 vocabulary = words.keys()
 images = [compute_image(vocabulary, corpus) for corpus in filenames]
 Na, Nb, Nc, Nd, topic_of_words_per_image = init_LDA(images, topics, len(vocabulary)) 
 #Gibbs Sampling
 probabilities = np.zeros(topics)
 for _ in xrange(ITERATIONS):
 for i, image in enumerate(images):
 topic_per_word = topic_of_words_per_image[i]
 for n, w in enumerate(image): 
 m = topic_per_word[n] 
 Na[i, m] -= 1
 Nb[i] -= 1
 Nc[m, w-1] -= 1
 Nd[m] -= 1
 # computing topic probability
 probabilities[m] = Na[i, m] * Nc[m, w-1]/(Nb[i] * Nd[m])
 # choosing new topic based on this
 q = np.random.multinomial(1, probabilities/probabilities.sum()).argmax()
 # assigning word to topic
 topic_per_word[n] = q
 Na[i, q] += 1
 Nb[i] += 1
 Nc[q, w-1] += 1
 Nd[q] += 1
 distances = Nc/Nd[:, np.newaxis] #Words by Topic and printing
 return distances, vocabulary, words
if __name__ == '__main__':
 topics = 2
 #Add as many filenames as needed, like LDA(topics, 'corpus1.txt', 'corpus2.txt', 'corpus3.txt')
 distances, vocabulary, words_count = LDA(topics, 'corpus.txt')
 for topic in xrange(topics):
 for word_index in np.argsort(-distances[topic])[:20]:
 word = vocabulary[word_index]
 print "Topic", topic, word, distances[topic, word_index], words_count[word]

Things left to improve

You corpuses might contain words with punctuation and capitalization. You might want to split things according to string.punctuation and lowercase each word to avoid creating several keys for a single word. You may want to filter on the length of the string in doing so, or del words[''] afterward, depending on how you do things.
Naming is not optimal and hard to follow. Better take names of the variables from the LDA theory, like theta, phi instead of Na, Nc. Same goes for loop variables, some are not intuitive to be able to follow your implementation by looking at the theory from your wikipedia link.
As said at the beginning of the post, init_LDA might not compute the correct arrays.

Hey, wow - thanks a lot for all the info. I will try an implement this tomorrow and see how it goes. I wasn't using wikipedia to build my program, I put a link in the desc. for the book "Computer Vision" which tackles different methods of image classification, and goes over LDA, which is why my nomenclature is different. I've tried to avoid functions because I found them difficult to debug, vs another part where you can just ## it out, but I should probably adapt :p

Stack Exchange Network

Latent Dirichlet Allocation in Python

1 Answer 1

Keep spaces consistent

Use functions

Be nice to the memory

Constants

Building data structures

Incomplete improvement

Things left to improve

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Latent Dirichlet Allocation in Python

1 Answer 1

Keep spaces consistent

Use functions

Be nice to the memory

Constants

Building data structures

Incomplete improvement

Things left to improve

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions