Python - One-to-many dictionary mapping

Question 1

Need a fast implementation of one-to-many dict mapping. The task is to convert candidates produced by a generator to its corresponding indices.

A generator can generate several word candidates at each position of a given sentence by looking ahead n words see if any appears in the lexicon.

Input:

tokens = ['by', 'and', 'large', 'multiword', 'expression', 'are', 'a', 'pain', 'in', 'the', 'neck']

Word index dictionary, which represents positions of the word in the sentence

SlicableDict([(0, 'by'), (1, 'and'), (2, 'large'), (3, 'multiword'), (4, 'expression'), (5, 'are'), (6, 'a'), (7, 'pain'), (8, 'in'), (9, 'the'), (10, 'neck')])

Output:

#candidates
[[('by', 'and'), ('by', 'and', 'large'), ('by', 'and', 'large', 'multiword')], [('and', 'large'), ('and', 'large', 'multiword'), ('and', 'large', 'multiword', 'expression')], [('large', 'multiword'), ('large', 'multiword', 'expression'), ('large', 'multiword', 'expression', 'are')], [('multiword', 'expression'), ('multiword', 'expression', 'are'), ('multiword', 'expression', 'are', 'a')], [('expression', 'are'), ('expression', 'are', 'a'), ('expression', 'are', 'a', 'pain')], [('are', 'a'), ('are', 'a', 'pain'), ('are', 'a', 'pain', 'in')], [('a', 'pain'), ('a', 'pain', 'in'), ('a', 'pain', 'in', 'the')], [('pain', 'in'), ('pain', 'in', 'the'), ('pain', 'in', 'the', 'neck')], [('in', 'the'), ('in', 'the', 'neck')], [('the', 'neck')], []]
# indices
[[[0, 1], [0, 1, 2], [0, 1, 2, 3]], [[1, 2], [1, 2, 3], [1, 2, 3, 4]], [[2, 3], [2, 3, 4], [2, 3, 4, 5]], [[3, 4], [3, 4, 5], [3, 4, 5, 6]], [[4, 5], [4, 5, 6], [4, 5, 6, 7]], [[5, 6], [5, 6, 7], [5, 6, 7, 8]], [[6, 7], [6, 7, 8], [6, 7, 8, 9]], [[7, 8], [7, 8, 9], [7, 8, 9, 10]], [[8, 9], [8, 9, 10]], [[9, 10]], []]

Difficult case:

tokens = ['identify', 'cancer', 'as', 'well', 'as', 'doctor']

more than one as appears in the sentence, mapping is going to be one-to-many

output:

[[[0, 1], [0, 1, 2], [0, 1, 2, 3]], [[1, 2], [1, 2, 3], [1, 2, 3, 4]], [[2, 3], [2, 3, 4], [2, 3, 4, 5]], [[3, 4], [3, 4, 5]], [[4, 5]], []]
[[('identify', 'cancer'), ('identify', 'cancer', 'as'), ('identify', 'cancer', 'as', 'well')], [('cancer', 'as'), ('cancer', 'as', 'well'), ('cancer', 'as', 'well', 'as')], [('as', 'well'), ('as', 'well', 'as'), ('as', 'well', 'as', 'doctor')], [('well', 'as'), ('well', 'as', 'doctor')], [('as', 'doctor')], []]

My attempt

def generate_sent_position_candidates_and_indices(sent, ne_size):
 word2index = SlicableDict({index:word for index, word in enumerate(sent)})
 # print(word2index)
 indices = []
 pos_cands = []
 for i in range(len(sent)):
 # at each position only look at n words ahead, cut the dict
 curnt_dict = word2index[i:i+self.n]
 # one-to-many reversed dict, e.g., {word:[idx1, idx2]}
 reversed_dict = defaultdict(list)
 for key, value in curnt_dict.items():
 reversed_dict[value].append(key)
 # generate candidates at current position
 curnt_pos_cands = list(self.generate_candidates(sent[i:], ne_size))
 curnt_indices = []
 if curnt_pos_cands:
 for mwe in curnt_pos_cands:
 pool = []
 tmp = []
 for word in mwe:
 word_index = Counter(pool)[word]
 pool.append(word)
 tmp.append(reversed_dict[word][word_index])
 curnt_indices.append(tmp)
 indices.append(curnt_indices)
 pos_cands.append(curnt_pos_cands)
 return indices, pos_cands

I've created a SlicableDict and a reversed_dict at every sentence position, and maintained a pool recording words have already seen. Then, use Counter to locate the index from reversed_dict. I've tested the speed, which is 10 times slower than the one with no returning indices. Is there anything I can do to improve the speed?

Edited

The implmentation of SlicableDict

# ref: https://stackoverflow.com/questions/30975339/slicing-a-python-ordereddict
class SlicableDict(OrderedDict):
 def __getitem__(self, k):
 if not isinstance(k, slice):
 return OrderedDict.__getitem__(self, k)
 return SlicableDict(islice(self.items(), k.start, k.stop))

Edited

Runnable code for test

class Test():
 n = 6
 def __init__(self):
 self.test_case()
 def ne_generator(self, tokens, candidates, ne_size=4):
 """Name entity generator extends generated candidates.
 Basically, it generates some word permutations relating to 1st token
 """
 if len(tokens) != 1:
 current_ne = (tokens[0],)
 if len(tokens) < ne_size:
 ne_size = len(tokens)
 for i in range(1, ne_size):
 current_ne += (tokens[i],)
 if current_ne not in candidates:
 candidates.append(current_ne)
 return candidates
 def generate_candidates(self, tokens, ne_size):
 # ne generator
 candidates = self.ne_generator(tokens, [], ne_size=ne_size)
 return candidates
 def generate_sent_position_candidates_and_indices(self, sent, ne_size):
 word2index = SlicableDict({index: word for index, word in enumerate(sent)})
 # print(word2index)
 indices = []
 pos_cands = []
 for i in range(len(sent)):
 # at each position only look at n words ahead, cut the dict
 curnt_dict = word2index[i:i + self.n]
 # one-to-many reversed dict, e.g., {word:[idx1, idx2]}
 reversed_dict = defaultdict(list)
 for key, value in curnt_dict.items():
 reversed_dict[value].append(key)
 # generate candidates at current position
 curnt_pos_cands = list(self.generate_candidates(sent[i:], ne_size))
 curnt_indices = []
 if curnt_pos_cands:
 for mwe in curnt_pos_cands:
 pool = []
 tmp = []
 for word in mwe:
 word_index = Counter(pool)[word]
 pool.append(word)
 tmp.append(reversed_dict[word][word_index])
 curnt_indices.append(tmp)
 indices.append(curnt_indices)
 pos_cands.append(curnt_pos_cands)
 return indices, pos_cands
 def test_case(self):
 tokens = ['identify', 'cancer', 'as', 'well', 'as', 'doctor']
 a, b = self.generate_sent_position_candidates_and_indices(tokens, 4)
 assert a == [[[0, 1], [0, 1, 2], [0, 1, 2, 3]],\
 [[1, 2], [1, 2, 3], [1, 2, 3, 4]],\
 [[2, 3], [2, 3, 4], [2, 3, 4, 5]],\
 [[3, 4], [3, 4, 5]], [[4, 5]], []]
 assert b == [[('identify', 'cancer'), ('identify', 'cancer', 'as'),\
 ('identify', 'cancer', 'as', 'well')],\
 [('cancer', 'as'), ('cancer', 'as', 'well'), ('cancer', 'as', 'well', 'as')],\
 [('as', 'well'), ('as', 'well', 'as'), ('as', 'well', 'as', 'doctor')],\
 [('well', 'as'), ('well', 'as', 'doctor')], [('as', 'doctor')], []]

Question 2

Would you be willing to provide the code for self.generate_candidates(sent[i:], ne_size)? And an example usage of the function

Question 3

Thanks for the tips, candidates are the results retrieved from a lexicon tree traversal, I forgot to change it. I'll update with a collected test file shortly.

Question 4

You don't need the majoraty of your code. You only need to keep ne_generator and to implement a custom itertools.pairwise, that works with any amount. To make it work the same way as your code, you need to use itertools.zip_longest, with custom sentinal values, that we remove. This can be for example:

from itertools import tee, zip_longest
NO_FILLVALUE = object()
def nth_wise(iterable, n=2, fillvalue=NO_FILLVALUE):
 its = tee(iterable)
 for n, it in enumerate(its):
 for _ in range(n):
 next(it, None)
 if fillvalue is NO_FILLVALUE:
 return zip(a, b)
 else:
 return zip_longest(*its, fillvalue=fillvalue)
def whatever(data, size):
 EMPTY = object()
 for items in nth_wise(data, n=size, fillvalue=EMPTY):
 yield tuple(item in item in items if item is not EMPTY)

This produces the items:

>>> list(whatever('abcde', 3))
[
 ('a', 'b', 'c'),
 ('b', 'c', 'd'),
 ('c', 'd', 'e'),
 ('d', 'e'),
 ('e')
]

From this you can then add your ne_generator code. Which can be drastically simplified, if you ignore removing duplicates. Allowing for:

def whatever(data, size):
 EMPTY = object()
 for items in nth_wise(data, n=size, fillvalue=EMPTY):
 items = (item for item in items if item is not EMPTY)
 data = (next(items),)
 for item in items:
 data += (item,)
 yield data

If you want to add the indexes to each peice of data, then you can just pass the data through enumerate. If you want to add your sublist stuff, then you can change the for loop to append to a list instad.

Here's an example of input and output of the above:

>>> list(whatever('abcde', 3))
[('a', 'b'), ('a', 'b', 'c'), ('b', 'c'), ('b', 'c', 'd'), ('c', 'd'), ('c', 'd', 'e'), ('d', 'e')]
>>> list(whatever(enumerate('abcde'), 3))
[((0, 'a'), (1, 'b')), ((0, 'a'), (1, 'b'), (2, 'c')), ((1, 'b'), (2, 'c')), ((1, 'b'), (2, 'c'), (3, 'd')), ((2, 'c'), (3, 'd')), ((2, 'c'), (3, 'd'), (4, 'e')), ((3, 'd'), (4, 'e'))]

Question 5

Thanks for the answer. For test purpose I just put part of the code, complete implementation could be found at [this link][1] [1] stackoverflow.com/questions/47183844/…, where candidates are generated by NE iteration and tree traversal. If this is the situation, I think we might need to change the indexing method.

Question 6

It would seem simpler to generate the candidates by just the indices, and translate them into words in the end:

def generate_sent_position_candidates_and_indices(self, sent, ne_size):
 indices = []
 pos_cands = [] 
 for i in range(len(sent)):
 curnt_indices = [list(cand) for cand in 
 self.generate_candidates(range(i, len(sent)), ne_size)]
 # look up the words by index
 curnt_pos_cands = [tuple(sent[j] for j in sublist) 
 for sublist in curnt_indices]
 indices.append(curnt_indices)
 pos_cands.append(curnt_pos_cands)
 return indices, pos_cands

(This code passes the test)

In this approach, should the candidates argument to ne_generator be non-empty, it must be converted from words to indices. IMO this would still be simpler overall.

Peilonrayz ♦Peilonrayz 44.4k7 gold badges80 silver badges157 bronze badges · Accepted Answer · 2017-11-15 11:48:44Z

You don't need the majoraty of your code. You only need to keep ne_generator and to implement a custom itertools.pairwise, that works with any amount. To make it work the same way as your code, you need to use itertools.zip_longest, with custom sentinal values, that we remove. This can be for example:

from itertools import tee, zip_longest
NO_FILLVALUE = object()
def nth_wise(iterable, n=2, fillvalue=NO_FILLVALUE):
 its = tee(iterable)
 for n, it in enumerate(its):
 for _ in range(n):
 next(it, None)
 if fillvalue is NO_FILLVALUE:
 return zip(a, b)
 else:
 return zip_longest(*its, fillvalue=fillvalue)
def whatever(data, size):
 EMPTY = object()
 for items in nth_wise(data, n=size, fillvalue=EMPTY):
 yield tuple(item in item in items if item is not EMPTY)

This produces the items:

>>> list(whatever('abcde', 3))
[
 ('a', 'b', 'c'),
 ('b', 'c', 'd'),
 ('c', 'd', 'e'),
 ('d', 'e'),
 ('e')
]

From this you can then add your ne_generator code. Which can be drastically simplified, if you ignore removing duplicates. Allowing for:

def whatever(data, size):
 EMPTY = object()
 for items in nth_wise(data, n=size, fillvalue=EMPTY):
 items = (item for item in items if item is not EMPTY)
 data = (next(items),)
 for item in items:
 data += (item,)
 yield data

If you want to add the indexes to each peice of data, then you can just pass the data through enumerate. If you want to add your sublist stuff, then you can change the for loop to append to a list instad.

Here's an example of input and output of the above:

>>> list(whatever('abcde', 3))
[('a', 'b'), ('a', 'b', 'c'), ('b', 'c'), ('b', 'c', 'd'), ('c', 'd'), ('c', 'd', 'e'), ('d', 'e')]
>>> list(whatever(enumerate('abcde'), 3))
[((0, 'a'), (1, 'b')), ((0, 'a'), (1, 'b'), (2, 'c')), ((1, 'b'), (2, 'c')), ((1, 'b'), (2, 'c'), (3, 'd')), ((2, 'c'), (3, 'd')), ((2, 'c'), (3, 'd'), (4, 'e')), ((3, 'd'), (4, 'e'))]

Thanks for the answer. For test purpose I just put part of the code, complete implementation could be found at [this link][1] [1] stackoverflow.com/questions/47183844/…, where candidates are generated by NE iteration and tree traversal. If this is the situation, I think we might need to change the indexing method.

Stack Exchange Network

Python - One-to-many dictionary mapping

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python - One-to-many dictionary mapping

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions