Using two wordlists to search a list of texts

Question 1

I have a function that takes two separate wordlists and searches a third list, which is a text formatted as a list of wordlists.

The function finds the proximity between words in word_list1 and word_list2 by taking the difference between their indexes; (it takes one over the difference, so that larger numbers will indicate closer proximity).

I ultimately will write the output to a .csv file and create a network of the word proximities in gephi.

This function works for me, but it is very slow when used on a large number of texts. Do you have any suggestions for making it more efficient? (If this is unclear at all, let me know, and I will try to clarify.)

text = [
 'This, Reader, is the entertainment of those who let loose their own thoughts, and follow them in writing, which thou oughtest not to envy them, since they afford thee an opportunity of the like diversion if thou wilt make use of thy own thoughts in reading.',
 'For the understanding, like the eye, judging of objects only by its own sight, cannot but be pleased with what it discovers, having less regret for what has escaped it, because it is unknown.'
]
word_list1 = ['entertainment', 'follow', 'joke', 'understanding']
word_list2 = ['envy', 'use', 'nada']
text_split = []
for line in text: 
 text_split.append(line.split(' '))
def word_relations(list_a, list_b, text): 
 relations = []
 for line in text:
 for i, item in enumerate(line): 
 for w in list_a:
 if w in item: 
 first_int = i
 first_word = w
 for t, item in enumerate(line):
 for x in list_b:
 if x in item: 
 second_int = t
 second_word = x
 if first_int:
 if second_int != first_int: 
 dist = 1.0 / abs(second_int-first_int)
 if dist in relations: 
 continue 
 else: 
 relations.append((first_word,
 second_word, dist))
 return(relations)
print(word_relations(word_list1, word_list2, text_split))

Here is the output:

[('entertainment', 'envy', 0.05263157894736842), ('entertainment', 'use', 0.02857142857142857), ('follow', 'envy', 0.1111111111111111), ('follow', 'use', 0.04), ('understanding', 'use', 0.03571428571428571)]

Question 2

Please give us example texts for easier testing. I tells me texts is undefined

Question 3

Sorry, somehow that got edited. It should be just 'text.'

Question 4

I have been testing with that 'text' variable. When I use the function on multiple texts, they are also stored as lists of sentences, but they are composed of several thousand sentences.

Question 5

Algorithm

You enumerate items within a loop also enumerating items. This means your algorithm is quadratic in the length of each sentence, which is bad. I think you can improve your algorithm to make only a single pass over items by creating a dict which stores unique words as keys, and lists of word indices as the values. Then you can lookup the appropriate indices of the items in your wordlists and perform the distance calculation. Since dict lookups are a constant-time operation, this reduces the complexity to linear in the length of each sentence. Do note that the algorithm is still quadratic in the length of your word lists, so there may be some improvement to be had if your word lists are long.

Correctness and Edge Cases

It's hard to tell exactly what this code is supposed to do, so I will be making a few assumptions. The handling of edge cases will vary depending on the requirements.

You likely have at least one bug in your code, which is reflected in your example output: you check if x in item, which will evaluate True for the string 'use' in the word 'because'. If this is not the desired behavior, you may want a stricter check like checking equality x == item, or something based on Levenshtein distance for a less strict evaluation.

Another possible bug is that you never include the first word of a sentence in your results. Your check if first_int: will be False for every word whose index is 0.

Code Style

Holy indentation, Batman! Deeply nested code is hard to read and understand, and usually indicates you can organize your code better. Usually the inner loops can be brought into their own function. You can sometimes reduce nesting by consolidating conditional statements. For example, an if statement immediately followed by another if with no else can be brought onto one line:

if first_int: if second_int != first_int:

can be written on one line as

if first_int and second_int != first_int:

Short variable names like w, t, and x aren't very descriptive, and make it hard for others to understand the code. Try to pick more descriptive names.

Make sure you don't include unnecessary logic. For example, your check if dist in relations will always be False, since you only insert tuples, and dist is a float. It can be removed, saving you a line of code and a level of indentation.

Aurelius AureliusAurelius 1,3688 silver badges21 bronze badges · Accepted Answer · 2015-04-21 05:46:10Z

Algorithm

You enumerate items within a loop also enumerating items. This means your algorithm is quadratic in the length of each sentence, which is bad. I think you can improve your algorithm to make only a single pass over items by creating a dict which stores unique words as keys, and lists of word indices as the values. Then you can lookup the appropriate indices of the items in your wordlists and perform the distance calculation. Since dict lookups are a constant-time operation, this reduces the complexity to linear in the length of each sentence. Do note that the algorithm is still quadratic in the length of your word lists, so there may be some improvement to be had if your word lists are long.

Correctness and Edge Cases

It's hard to tell exactly what this code is supposed to do, so I will be making a few assumptions. The handling of edge cases will vary depending on the requirements.

You likely have at least one bug in your code, which is reflected in your example output: you check if x in item, which will evaluate True for the string 'use' in the word 'because'. If this is not the desired behavior, you may want a stricter check like checking equality x == item, or something based on Levenshtein distance for a less strict evaluation.

Another possible bug is that you never include the first word of a sentence in your results. Your check if first_int: will be False for every word whose index is 0.

Code Style

Holy indentation, Batman! Deeply nested code is hard to read and understand, and usually indicates you can organize your code better. Usually the inner loops can be brought into their own function. You can sometimes reduce nesting by consolidating conditional statements. For example, an if statement immediately followed by another if with no else can be brought onto one line:

if first_int: if second_int != first_int:

can be written on one line as

if first_int and second_int != first_int:

Short variable names like w, t, and x aren't very descriptive, and make it hard for others to understand the code. Try to pick more descriptive names.

Make sure you don't include unnecessary logic. For example, your check if dist in relations will always be False, since you only insert tuples, and dist is a float. It can be removed, saving you a line of code and a level of indentation.

Stack Exchange Network

Using two wordlists to search a list of texts

1 Answer 1

Algorithm

Correctness and Edge Cases

Code Style

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Using two wordlists to search a list of texts

1 Answer 1

Algorithm

Correctness and Edge Cases

Code Style

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions