Need a fast implementation of one-to-many dict
mapping. The task is to convert candidates produced by a generator to its corresponding indices.
A generator can generate several word candidates at each position of a given sentence by looking ahead n
words see if any appears in the lexicon.
Input:
tokens = ['by', 'and', 'large', 'multiword', 'expression', 'are', 'a', 'pain', 'in', 'the', 'neck']
Word index dictionary, which represents positions of the word in the sentence
SlicableDict([(0, 'by'), (1, 'and'), (2, 'large'), (3, 'multiword'), (4, 'expression'), (5, 'are'), (6, 'a'), (7, 'pain'), (8, 'in'), (9, 'the'), (10, 'neck')])
Output:
#candidates
[[('by', 'and'), ('by', 'and', 'large'), ('by', 'and', 'large', 'multiword')], [('and', 'large'), ('and', 'large', 'multiword'), ('and', 'large', 'multiword', 'expression')], [('large', 'multiword'), ('large', 'multiword', 'expression'), ('large', 'multiword', 'expression', 'are')], [('multiword', 'expression'), ('multiword', 'expression', 'are'), ('multiword', 'expression', 'are', 'a')], [('expression', 'are'), ('expression', 'are', 'a'), ('expression', 'are', 'a', 'pain')], [('are', 'a'), ('are', 'a', 'pain'), ('are', 'a', 'pain', 'in')], [('a', 'pain'), ('a', 'pain', 'in'), ('a', 'pain', 'in', 'the')], [('pain', 'in'), ('pain', 'in', 'the'), ('pain', 'in', 'the', 'neck')], [('in', 'the'), ('in', 'the', 'neck')], [('the', 'neck')], []]
# indices
[[[0, 1], [0, 1, 2], [0, 1, 2, 3]], [[1, 2], [1, 2, 3], [1, 2, 3, 4]], [[2, 3], [2, 3, 4], [2, 3, 4, 5]], [[3, 4], [3, 4, 5], [3, 4, 5, 6]], [[4, 5], [4, 5, 6], [4, 5, 6, 7]], [[5, 6], [5, 6, 7], [5, 6, 7, 8]], [[6, 7], [6, 7, 8], [6, 7, 8, 9]], [[7, 8], [7, 8, 9], [7, 8, 9, 10]], [[8, 9], [8, 9, 10]], [[9, 10]], []]
Difficult case:
tokens = ['identify', 'cancer', 'as', 'well', 'as', 'doctor']
more than one as
appears in the sentence, mapping is going to be one-to-many
output:
[[[0, 1], [0, 1, 2], [0, 1, 2, 3]], [[1, 2], [1, 2, 3], [1, 2, 3, 4]], [[2, 3], [2, 3, 4], [2, 3, 4, 5]], [[3, 4], [3, 4, 5]], [[4, 5]], []]
[[('identify', 'cancer'), ('identify', 'cancer', 'as'), ('identify', 'cancer', 'as', 'well')], [('cancer', 'as'), ('cancer', 'as', 'well'), ('cancer', 'as', 'well', 'as')], [('as', 'well'), ('as', 'well', 'as'), ('as', 'well', 'as', 'doctor')], [('well', 'as'), ('well', 'as', 'doctor')], [('as', 'doctor')], []]
My attempt
def generate_sent_position_candidates_and_indices(sent, ne_size):
word2index = SlicableDict({index:word for index, word in enumerate(sent)})
# print(word2index)
indices = []
pos_cands = []
for i in range(len(sent)):
# at each position only look at n words ahead, cut the dict
curnt_dict = word2index[i:i+self.n]
# one-to-many reversed dict, e.g., {word:[idx1, idx2]}
reversed_dict = defaultdict(list)
for key, value in curnt_dict.items():
reversed_dict[value].append(key)
# generate candidates at current position
curnt_pos_cands = list(self.generate_candidates(sent[i:], ne_size))
curnt_indices = []
if curnt_pos_cands:
for mwe in curnt_pos_cands:
pool = []
tmp = []
for word in mwe:
word_index = Counter(pool)[word]
pool.append(word)
tmp.append(reversed_dict[word][word_index])
curnt_indices.append(tmp)
indices.append(curnt_indices)
pos_cands.append(curnt_pos_cands)
return indices, pos_cands
I've created a SlicableDict
and a reversed_dict
at every sentence position, and maintained a pool
recording words have already seen. Then, use Counter
to locate the index from reversed_dict
. I've tested the speed, which is 10 times slower than the one with no returning indices. Is there anything I can do to improve the speed?
Edited
The implmentation of SlicableDict
# ref: https://stackoverflow.com/questions/30975339/slicing-a-python-ordereddict
class SlicableDict(OrderedDict):
def __getitem__(self, k):
if not isinstance(k, slice):
return OrderedDict.__getitem__(self, k)
return SlicableDict(islice(self.items(), k.start, k.stop))
Edited
Runnable code for test
class Test():
n = 6
def __init__(self):
self.test_case()
def ne_generator(self, tokens, candidates, ne_size=4):
"""Name entity generator extends generated candidates.
Basically, it generates some word permutations relating to 1st token
"""
if len(tokens) != 1:
current_ne = (tokens[0],)
if len(tokens) < ne_size:
ne_size = len(tokens)
for i in range(1, ne_size):
current_ne += (tokens[i],)
if current_ne not in candidates:
candidates.append(current_ne)
return candidates
def generate_candidates(self, tokens, ne_size):
# ne generator
candidates = self.ne_generator(tokens, [], ne_size=ne_size)
return candidates
def generate_sent_position_candidates_and_indices(self, sent, ne_size):
word2index = SlicableDict({index: word for index, word in enumerate(sent)})
# print(word2index)
indices = []
pos_cands = []
for i in range(len(sent)):
# at each position only look at n words ahead, cut the dict
curnt_dict = word2index[i:i + self.n]
# one-to-many reversed dict, e.g., {word:[idx1, idx2]}
reversed_dict = defaultdict(list)
for key, value in curnt_dict.items():
reversed_dict[value].append(key)
# generate candidates at current position
curnt_pos_cands = list(self.generate_candidates(sent[i:], ne_size))
curnt_indices = []
if curnt_pos_cands:
for mwe in curnt_pos_cands:
pool = []
tmp = []
for word in mwe:
word_index = Counter(pool)[word]
pool.append(word)
tmp.append(reversed_dict[word][word_index])
curnt_indices.append(tmp)
indices.append(curnt_indices)
pos_cands.append(curnt_pos_cands)
return indices, pos_cands
def test_case(self):
tokens = ['identify', 'cancer', 'as', 'well', 'as', 'doctor']
a, b = self.generate_sent_position_candidates_and_indices(tokens, 4)
assert a == [[[0, 1], [0, 1, 2], [0, 1, 2, 3]],\
[[1, 2], [1, 2, 3], [1, 2, 3, 4]],\
[[2, 3], [2, 3, 4], [2, 3, 4, 5]],\
[[3, 4], [3, 4, 5]], [[4, 5]], []]
assert b == [[('identify', 'cancer'), ('identify', 'cancer', 'as'),\
('identify', 'cancer', 'as', 'well')],\
[('cancer', 'as'), ('cancer', 'as', 'well'), ('cancer', 'as', 'well', 'as')],\
[('as', 'well'), ('as', 'well', 'as'), ('as', 'well', 'as', 'doctor')],\
[('well', 'as'), ('well', 'as', 'doctor')], [('as', 'doctor')], []]
2 Answers 2
You don't need the majoraty of your code. You only need to keep ne_generator
and to implement a custom itertools.pairwise
, that works with any amount.
To make it work the same way as your code, you need to use itertools.zip_longest
, with custom sentinal values, that we remove.
This can be for example:
from itertools import tee, zip_longest
NO_FILLVALUE = object()
def nth_wise(iterable, n=2, fillvalue=NO_FILLVALUE):
its = tee(iterable)
for n, it in enumerate(its):
for _ in range(n):
next(it, None)
if fillvalue is NO_FILLVALUE:
return zip(a, b)
else:
return zip_longest(*its, fillvalue=fillvalue)
def whatever(data, size):
EMPTY = object()
for items in nth_wise(data, n=size, fillvalue=EMPTY):
yield tuple(item in item in items if item is not EMPTY)
This produces the items:
>>> list(whatever('abcde', 3))
[
('a', 'b', 'c'),
('b', 'c', 'd'),
('c', 'd', 'e'),
('d', 'e'),
('e')
]
From this you can then add your ne_generator
code. Which can be drastically simplified, if you ignore removing duplicates.
Allowing for:
def whatever(data, size):
EMPTY = object()
for items in nth_wise(data, n=size, fillvalue=EMPTY):
items = (item for item in items if item is not EMPTY)
data = (next(items),)
for item in items:
data += (item,)
yield data
If you want to add the indexes to each peice of data, then you can just pass the data through enumerate
. If you want to add your sublist stuff, then you can change the for loop to append to a list instad.
Here's an example of input and output of the above:
>>> list(whatever('abcde', 3))
[('a', 'b'), ('a', 'b', 'c'), ('b', 'c'), ('b', 'c', 'd'), ('c', 'd'), ('c', 'd', 'e'), ('d', 'e')]
>>> list(whatever(enumerate('abcde'), 3))
[((0, 'a'), (1, 'b')), ((0, 'a'), (1, 'b'), (2, 'c')), ((1, 'b'), (2, 'c')), ((1, 'b'), (2, 'c'), (3, 'd')), ((2, 'c'), (3, 'd')), ((2, 'c'), (3, 'd'), (4, 'e')), ((3, 'd'), (4, 'e'))]
-
\$\begingroup\$ Thanks for the answer. For test purpose I just put part of the code, complete implementation could be found at [this link][1] [1] stackoverflow.com/questions/47183844/…, where candidates are generated by NE iteration and tree traversal. If this is the situation, I think we might need to change the indexing method. \$\endgroup\$Logan– Logan2017年11月15日 12:26:16 +00:00Commented Nov 15, 2017 at 12:26
It would seem simpler to generate the candidates by just the indices, and translate them into words in the end:
def generate_sent_position_candidates_and_indices(self, sent, ne_size):
indices = []
pos_cands = []
for i in range(len(sent)):
curnt_indices = [list(cand) for cand in
self.generate_candidates(range(i, len(sent)), ne_size)]
# look up the words by index
curnt_pos_cands = [tuple(sent[j] for j in sublist)
for sublist in curnt_indices]
indices.append(curnt_indices)
pos_cands.append(curnt_pos_cands)
return indices, pos_cands
(This code passes the test)
In this approach, should the candidates
argument to ne_generator
be non-empty, it must be converted from words to indices. IMO this would still be simpler overall.
self.generate_candidates(sent[i:], ne_size)
? And an example usage of the function \$\endgroup\$candidates
are the results retrieved from a lexicon tree traversal, I forgot to change it. I'll update with a collected test file shortly. \$\endgroup\$