I have a function that takes two separate wordlists and searches a third list, which is a text formatted as a list of wordlists.
The function finds the proximity between words in word_list1
and word_list2
by taking the difference between their indexes; (it takes one over the difference, so that larger numbers will indicate closer proximity).
I ultimately will write the output to a .csv file and create a network of the word proximities in gephi.
This function works for me, but it is very slow when used on a large number of texts. Do you have any suggestions for making it more efficient? (If this is unclear at all, let me know, and I will try to clarify.)
text = [
'This, Reader, is the entertainment of those who let loose their own thoughts, and follow them in writing, which thou oughtest not to envy them, since they afford thee an opportunity of the like diversion if thou wilt make use of thy own thoughts in reading.',
'For the understanding, like the eye, judging of objects only by its own sight, cannot but be pleased with what it discovers, having less regret for what has escaped it, because it is unknown.'
]
word_list1 = ['entertainment', 'follow', 'joke', 'understanding']
word_list2 = ['envy', 'use', 'nada']
text_split = []
for line in text:
text_split.append(line.split(' '))
def word_relations(list_a, list_b, text):
relations = []
for line in text:
for i, item in enumerate(line):
for w in list_a:
if w in item:
first_int = i
first_word = w
for t, item in enumerate(line):
for x in list_b:
if x in item:
second_int = t
second_word = x
if first_int:
if second_int != first_int:
dist = 1.0 / abs(second_int-first_int)
if dist in relations:
continue
else:
relations.append((first_word,
second_word, dist))
return(relations)
print(word_relations(word_list1, word_list2, text_split))
Here is the output:
[('entertainment', 'envy', 0.05263157894736842), ('entertainment', 'use', 0.02857142857142857), ('follow', 'envy', 0.1111111111111111), ('follow', 'use', 0.04), ('understanding', 'use', 0.03571428571428571)]
1 Answer 1
Algorithm
You enumerate items
within a loop also enumerating items
. This means your algorithm is quadratic in the length of each sentence, which is bad. I think you can improve your algorithm to make only a single pass over items
by creating a dict
which stores unique words as keys, and lists of word indices as the values. Then you can lookup the appropriate indices of the items in your wordlists and perform the distance calculation. Since dict
lookups are a constant-time operation, this reduces the complexity to linear in the length of each sentence. Do note that the algorithm is still quadratic in the length of your word lists, so there may be some improvement to be had if your word lists are long.
Correctness and Edge Cases
It's hard to tell exactly what this code is supposed to do, so I will be making a few assumptions. The handling of edge cases will vary depending on the requirements.
You likely have at least one bug in your code, which is reflected in your example output: you check if x in item
, which will evaluate True
for the string 'use'
in the word 'because'
. If this is not the desired behavior, you may want a stricter check like checking equality x == item
, or something based on Levenshtein distance for a less strict evaluation.
Another possible bug is that you never include the first word of a sentence in your results. Your check if first_int:
will be False
for every word whose index is 0
.
Code Style
Holy indentation, Batman! Deeply nested code is hard to read and understand, and usually indicates you can organize your code better. Usually the inner loops can be brought into their own function. You can sometimes reduce nesting by consolidating conditional statements. For example, an if
statement immediately followed by another if
with no else
can be brought onto one line:
if first_int: if second_int != first_int:
can be written on one line as
if first_int and second_int != first_int:
Short variable names like w
, t
, and x
aren't very descriptive, and make it hard for others to understand the code. Try to pick more descriptive names.
Make sure you don't include unnecessary logic. For example, your check if dist in relations
will always be False
, since you only insert tuples, and dist
is a float. It can be removed, saving you a line of code and a level of indentation.
texts is undefined
\$\endgroup\$