5
\$\begingroup\$

I have a large list of lists, where each list contains words/characters as elements, and each list is of different length and may contain same words/characters more than once, as for example:

list_ = [['black', 'window', 'says', 'window'], ['dog', 'great'], ['after', 'end', 'explains', 'income', '.', '?']] 

I also have a large dictionary of words/characters like this:

dictionary = {'word1': 'RARE', 'word2': 'RARE', 'word3': 'RARE', '%': 'RARE'} etc.

The length of list_ is around 40 000 (the total number of words in all lists is about 1 mln) and the length of dictionary is around 31 000. For each word in dictionary, I need to check if it's in any of the lists of list_ and if it is, I need to replace it with 'RARE'.

For example, if 'black'== 'word1' and 'after' == word3, the final list_ will be:

list_ = [['RARE', 'window', 'says'], ['dog', 'says'], ['RARE', 'end', 'explains', 'income']]

Because of the large length of list_ and dictionary, the code that I currently have below takes almost 40min to run. I'm new to Python and I would appreciate if someone could suggest whether there's a way to do this task much faster/more efficiently.

def replace_all(word_list, dictionary):
 for i, j in dictionary.iteritems():
 for lst in word_list: 
 if i in set(lst): 
 lst[lst.index(i)] = j
replace_all(list_, dictionary) 
Scimonster
3261 silver badge10 bronze badges
asked Feb 7, 2015 at 20:32
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

You're looping in incorrect order, take the advantage of the fact that dictionary provides O(1) lookups. So, you should loop over the list and replace its items with the value present in the dictionary. This way you will loop over the list only once:

We can either use dict.get here and avoid an if condition:

def replace_matched_items(word_list, dictionary):
 for lst in word_list:
 for ind, item in enumerate(lst):
 lst[ind] = dictionary.get(item, item)

Or check for the key first and modify only matched item(LBYL):

def replace_matched_items(word_list, dictionary):
 for lst in word_list:
 for ind, item in enumerate(lst):
 if item in dictionary:
 lst[ind] = dictionary[item]

or if you prefer EAFP then try to catch the KeyError:

def replace_matched_items(word_list, dictionary):
 for lst in word_list:
 for ind, item in enumerate(lst):
 try:
 lst[ind] = dictionary[item]
 except KeyError:
 pass 

I generally find functions that modify a mutable item in-place and don't return anything confusing, it's usually a better idea to return the list object from the function and re-assign to the variable from where it was called.

def replace_matched_items(word_list, dictionary):
 for lst in word_list:
 for ind, item in enumerate(lst):
 lst[ind] = dictionary.get(item, item)
list_ = replace_matched_items(list_, dictionary)

But the above approach is still flawed(at least when our object has multiple references), with this even though we re-assigned only a single variable all the other references to the list object will still be affected, which can be very confusing to users. For example:

def func(seq):
 for lst in seq:
 for i, x in enumerate(lst):
 if x == 1:
 lst[i] = 100
 return seq
foo = [[1, 1, 2], [1, 2, 3]]
dct = {'key': foo}
foo = func(foo)
print foo
print dct

Outputs:

[[100, 100, 2], [100, 2, 3]]
{'key': [[100, 100, 2], [100, 2, 3]]}

To fix such such cases we can either pass a deepcopy of the list to the function(or simple shallow copy for list containing only immutable items) or we can create a new list inside of the function. But as we are dealing with 1 million items creating another list will surely result in extra memory consumption. Here's an example of how to do this by creating a new list using dict.get:

def replace_matched_items(word_list, dictionary):
 new_list = [[dictionary.get(item, item) for item in lst] for lst in word_list]
 return new_list
list_ = replace_matched_items(list_, dictionary)

Few other questions on the same topic:

answered Feb 7, 2015 at 20:48
\$\endgroup\$
2
  • \$\begingroup\$ Ashwini Chaudhary, thanks a lot for all your suggestions, this was incredibly helpful! I tried it and your code takes only a sec to run! \$\endgroup\$ Commented Feb 7, 2015 at 21:35
  • \$\begingroup\$ On a related note, if you are looking for in-place replacements within a list, take a look at this answer \$\endgroup\$ Commented Nov 23, 2016 at 3:37

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.