I've recently finished writing a "simple-as-possible" LDA code in Python.
The theory from which I've developed my code can be found in the book Computer Vision by Simon Prince, free (courtesy of Simon Prince) pdf can be found on his website: http://computervisionmodels.com/ - Chapter 20 talks about LDA. Applied to computer vision, he gives the notation i as the number of images, m as the number of parts or topics, and w as the words.
After re-doing the code today, I've found I get results which I expect (sorting a set of words into two or more topics). I've been comparing to other LDA code which is doing this more accurately and I'm wondering if the way I'm writing the code is making it inefficient. Any feedback appreciated :)
import numpy as np
corpus = open('corpus3.txt').read()
#Creatinig dictionary of unique terms, indexed and counted
dic = {}
arr=[]
arrv=[]
for item in corpus.split():
if item in dic:
dic[item] += 1
else:
dic[item] = 1
arr = dic.keys()
arrv= dic.values()
arrid=range(0,len(arr))
#Replacing actual words in doc with the word id's
Imgvv=[]
for w in corpus.split():
for i in arrid:
if w == arr[i]:
Imgvv.append(i)
Imgv = [Imgvv] # Array of (array of) words in documents (replaced with id's)
Vocab = arr #Vocab of unique terms
I = len(Imgv) #Image number
M = 2 # Part number - hardwired (supervised learning)
V = len(Vocab) #vocabulary
#Dirichlet constants
alpha=0.5
beta=0.5
#Initialise the 4 counters used in Gibbs sampling
Na = np.zeros((I, M)) + alpha # umber of words for each document, topic combo i.e 11, 12,13 -> 21,22,23 array.
Nb = np.zeros(I) + M*alpha # number of words in each image
Nc = np.zeros((M, V)) + beta # word count of each topic and vocabulary, times the word is in topic M and is of vocab number 1,2,3, etc..
Nd = np.zeros(M) + V*beta # number of words in each topic
m_w = [] #topic of the current word
m_i_w=[] # topic of the image of the word
#Filling up counters
for i,img in enumerate(Imgv):
for w in img:
m = np.random.randint(0,M)
m_w.append(m)
Na[i,m] += 1
Nb[i] += 1
Nc[m,w] += 1
Nd[m] += 1
m_i_w.append(np.array(m_w)) #creating a relationship between topic to word per doc
#Gibbs Sampling
m_i=[]
q = np.zeros(M)
for t in xrange(500): #Iterations
for i,img in enumerate(Imgv): #in the Imgv matrix there are i documents which are arrays (img) filled with words
m_w = m_i_w[i] #Finding topic of word
Nab = Na[i] #Taking ith row of the Na counter (array)
for n, w in enumerate(img): #in img there are n words of value w
m = m_w[n] # From the intialised/appended topic-word value we draw the "guessed" topic
Nab[m] -= 1
Nb[i] -= 1 #In Gibbs Samp. we compute for all values except the current (x,y) position
Nc[m,w] -= 1 #So we move the counter of this positon down one, compute
Nd[m] -= 1 #And then add one back after reloading the topic for the word
q = (Nab*(Nc[:,w]))/((Nb[i])*(Nd)) # computing topic probability
q_new = np.random.multinomial(1, q/q.sum()).argmax() # choosing new topic based on this
m_w[n] = q_new # assigning word to topic, replacing the guessed topic from init.
Nab[q_new] += 1 #Putting the counters back to original value before redoing process.
Nb[i] += 1
Nc[q_new,w] += 1
Nd[q_new] += 1
WordDist = Nc/Nd[:, np.newaxis] # This gives us the words per topic
for m in xrange(M): #Displaying results
for w in np.argsort(-WordDist[m])[:20]:
print("Topic", m, Vocab[w], WordDist[m,w],arrv[w])
1 Answer 1
Quickly reading the wikipedia link you provided, it sounds like your implementation slightly differ from the theory, especially the initialization part. I’m not expert, however, and will not try to dig it further. Just going for a style review:
Keep spaces consistent
As it currently stand, your spaces around =
or ,
are not consistent and can be quite uncomfortable to read. Some calculus is also pretty dense and could gain some readability by using spaces.
Use functions
You could more easily test your script and figure why you don't get the results you're expecting if you split it in small functions. Reading files, indexing words, initializing computation and performing LDA seems to be the 4 tasks you do out of pretty-printing the results.
Be nice to the memory
Twice in your script, you use more than three times the space needed to store 'corpus.txt'
: corpus
, dic
and corpus.split()
all contains each word of 'corpus.txt'
with additional data. I don't know how big is the file but this is only one file, LDA allows for more (namely I
to stay in your code context).
Since you're already reading through the content of the file twice, why not open the files twice and process them line by line each time to avoid overflooding the memory?
And talking about handling more files, your code should allow that more easily. Using functions with variable number of arguments is a way of doing it.
Constants
It is best to define them at the top of the file with an ALL_CAPS name. alpha
and beta
can qualify (they can be seen as variable since they are parameters of LDA), but more importantly the number of iteration is one of them.
Building data structures
To count elements in an iterable, you can use the collections.Counter
class which is a subclass of dict
:
words = Counter(data.split())
is more readable and understandable than
words = {}
for w in data.split():
try:
words[w] = words[w] + 1
except KeyError:
words[w] = 1
As such, it is often recommended to use list-comprehension rather than:
data = []
for elem in other_data:
data.append(process(elem))
Incomplete improvement
(Based on Rev 2 of the question)
import numpy as np
from collections import Counter
ALPHA = 100
BETA = 5
ITERATIONS = 1000
def read_corpuses(*filenames):
words = Counter()
for corpus_file in filenames:
with open(corpus_file) as corpus:
words.update(word for line in corpus for word in line.split())
return words
def compute_image(vocabulary, corpus_filename):
with open(corpus_filename) as corpus:
return [vocabulary.index(word) for line in corpus for word in line.split()]
def init_LDA(images, M, V):
I = len(images)
Na = np.zeros((I, M)) + ALPHA # umber of words for each document, topic combo i.e 11, 12,13 -> 21,22,23 array.
Nb = np.zeros(I) + M*ALPHA # number of words in each image
Nc = np.zeros((M, V)) + BETA # word count of each topic and vocabulary, times the word is in topic M and is of vocab number 1,2,3, etc..
Nd = np.zeros(M) + V*BETA # number of words in each topic
def inner(i, w):
m = np.random.randint(0, M)
Na[i, m] += 1
Nb[i] += 1
Nc[m, w-1] += 1
Nd[m] += 1
return m
return Na, Nb, Nc, Nd, [[inner(i, w) for w in image] for i, image in enumerate(images)]
def LDA(topics, *filenames):
words = read_corpuses(*filenames)
vocabulary = words.keys()
images = [compute_image(vocabulary, corpus) for corpus in filenames]
Na, Nb, Nc, Nd, topic_of_words_per_image = init_LDA(images, topics, len(vocabulary))
#Gibbs Sampling
probabilities = np.zeros(topics)
for _ in xrange(ITERATIONS):
for i, image in enumerate(images):
topic_per_word = topic_of_words_per_image[i]
for n, w in enumerate(image):
m = topic_per_word[n]
Na[i, m] -= 1
Nb[i] -= 1
Nc[m, w-1] -= 1
Nd[m] -= 1
# computing topic probability
probabilities[m] = Na[i, m] * Nc[m, w-1]/(Nb[i] * Nd[m])
# choosing new topic based on this
q = np.random.multinomial(1, probabilities/probabilities.sum()).argmax()
# assigning word to topic
topic_per_word[n] = q
Na[i, q] += 1
Nb[i] += 1
Nc[q, w-1] += 1
Nd[q] += 1
distances = Nc/Nd[:, np.newaxis] #Words by Topic and printing
return distances, vocabulary, words
if __name__ == '__main__':
topics = 2
#Add as many filenames as needed, like LDA(topics, 'corpus1.txt', 'corpus2.txt', 'corpus3.txt')
distances, vocabulary, words_count = LDA(topics, 'corpus.txt')
for topic in xrange(topics):
for word_index in np.argsort(-distances[topic])[:20]:
word = vocabulary[word_index]
print "Topic", topic, word, distances[topic, word_index], words_count[word]
Things left to improve
- You corpuses might contain words with punctuation and capitalization. You might want to split things according to
string.punctuation
and lowercase each word to avoid creating several keys for a single word. You may want to filter on the length of the string in doing so, ordel words['']
afterward, depending on how you do things. - Naming is not optimal and hard to follow. Better take names of the variables from the LDA theory, like
theta
,phi
instead ofNa
,Nc
. Same goes for loop variables, some are not intuitive to be able to follow your implementation by looking at the theory from your wikipedia link. - As said at the beginning of the post,
init_LDA
might not compute the correct arrays.
-
\$\begingroup\$ Hey, wow - thanks a lot for all the info. I will try an implement this tomorrow and see how it goes. I wasn't using wikipedia to build my program, I put a link in the desc. for the book "Computer Vision" which tackles different methods of image classification, and goes over LDA, which is why my nomenclature is different. I've tried to avoid functions because I found them difficult to debug, vs another part where you can just ## it out, but I should probably adapt :p \$\endgroup\$Fred– Fred2015年11月03日 20:32:45 +00:00Commented Nov 3, 2015 at 20:32
Explore related questions
See similar questions with these tags.