Mapping elements of nested list over values in a Python dict

Question 1

Suppose there's a 3-level nested list, for example,

[[['<s>', '<s>', 'eu', 'rejects', 'german'], ['<s>', 'eu', 'rejects', 'german', 'call']],\
 [['eu', 'rejects', 'german', 'call', 'to'], ['rejects', 'german', 'call', 'to', 'boycott']]]

level 1: sentence level
level 2: context window level
level 3: word/token level

Task: I'd like to map each token to its corresponding index with a pre-defined Python dictionary (e.g, word2index['xx']=4).

Output example:

[[[2522, 2522, 475, 1620, 397], [2522, 475, 1620, 397, 439]],\
 [[475, 1620, 397, 439, 4], [1620, 397, 439, 4, 1443]]]

The following is what I've done,

def word2index(input, model):
 """ Map words to index 
 """
 input2index = []
 for sent in input:
 sent2index = []
 for window in sent:
 window2index = []
 for elem in window:
 try:
 index = model.vocab[elem].index
 window2index.append(index)
 except KeyError:
 unk = model.vocab['unk'].index
 window2index.append(unk)
 sent2index.append(window2index)
 input2index.append(sent2index)
 return input2index

which involves lots of loops and creations of new list as intermediate containers. It'd be very easy to lose the track forgetting which container to append current data, especially when the nested list gets deeper.

Question: Would there be a more cleaner way to code such task?

Question 2

You can use the dict.get(key, default) method to provide a default value instead of catching an Exception. This makes your computation a single expression, which is good for further inlining:

for elem in window:
 index = model.vocab.get(elem, unk).index
 window2index.append(index)

Which can be rewritten as:

window2index = [model.vocab.get(elem, unk).index for elem in window]

Which is faster in Cpython, and fairly straightforward.

Now, of course, you have a similar pattern for sent, and a similar pattern for input:

input2index = [EXPR-USING-sent for sent in input]
sent2index = [EXPR-USING-window for window in sent]
window2index = ... see above ...

So you can stack them all together as:

unk = model.vocab['unk']
input2index = [EXPR-USING-sent for sent in input]
return input2index

which becomes:

unk = model.vocab['unk']
input2index = [[EXPR-USING-window for window in sent] for sent in input]
return input2index

which becomes:

unk = model.vocab['unk']
input2index = [[[model.vocab.get(elem, unk).index
 for elem in window]
 for window in sent] 
 for sent in input]
return input2index

Question 3

Great answer. However I pattern matched your last bit of code to flatten a 3d list to a 1d list, before tripping up when I saw the three [[[ at the start of the comprehension. It may be something you want to look into, :)

Question 4

Nice answer. One little nitpick on the final form. I would consider returning the nested list comprehension instead of assigning it to a variable simply to return it.

Question 5

Actually, I would too. But I know there's going to be some printing of data before it gets returned, just to check that it works :->

Question 6

Can I set more than one default values by adding a if condition? For instance, when dict[k] raises KeyError, assign a default value of <s> if k<1, else </s>.

aghast aghast 12.6k25 silver badges46 bronze badges · Accepted Answer · 2017-11-07 06:16:27Z

You can use the dict.get(key, default) method to provide a default value instead of catching an Exception. This makes your computation a single expression, which is good for further inlining:

for elem in window:
 index = model.vocab.get(elem, unk).index
 window2index.append(index)

Which can be rewritten as:

window2index = [model.vocab.get(elem, unk).index for elem in window]

Which is faster in Cpython, and fairly straightforward.

Now, of course, you have a similar pattern for sent, and a similar pattern for input:

input2index = [EXPR-USING-sent for sent in input]
sent2index = [EXPR-USING-window for window in sent]
window2index = ... see above ...

So you can stack them all together as:

unk = model.vocab['unk']
input2index = [EXPR-USING-sent for sent in input]
return input2index

which becomes:

unk = model.vocab['unk']
input2index = [[EXPR-USING-window for window in sent] for sent in input]
return input2index

which becomes:

unk = model.vocab['unk']
input2index = [[[model.vocab.get(elem, unk).index
 for elem in window]
 for window in sent] 
 for sent in input]
return input2index

Great answer. However I pattern matched your last bit of code to flatten a 3d list to a 1d list, before tripping up when I saw the three [[[ at the start of the comprehension. It may be something you want to look into, :)
Nice answer. One little nitpick on the final form. I would consider returning the nested list comprehension instead of assigning it to a variable simply to return it.
Actually, I would too. But I know there's going to be some printing of data before it gets returned, just to check that it works :->
Can I set more than one default values by adding a if condition? For instance, when dict[k] raises KeyError, assign a default value of <s> if k<1, else </s>.

Stack Exchange Network

Mapping elements of nested list over values in a Python dict

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Mapping elements of nested list over values in a Python dict

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions