Get a list of indexes for list A based on criteria in list B in Python

Question 1

I have one list of data with duplicate IDs and a second list of data with attributes for the IDs.

>>>a
[1,1,1,1,2,2,3,3]
>>>b
['No','No','Yes','No','Yes','No','No','No']

I want to create a list (c) that holds indexes for a that match the following criteria:

1) Only one index per ID in a

2) Indexes for IDs that have a 'Yes' in b take precedent

3) For IDs that don't have a 'Yes' any index is fine.

Using the above data as an example, the output would look like:

>>>c
[2,4,6]

So, you can see the data more easily:

>>> for c,(x,y) in enumerate(zip(a,b)): print(c,x,y)
...
0 1 No
1 1 No
2 1 Yes # grab this index ID 1
3 1 No
4 2 Yes # grab this index for ID 2
5 2 No 
6 3 No # grab this index, but either would have been fine for ID 3
7 3 No

Below is code that I've written. I am looking for a more efficient and Pythonic way to do it.

filterwhen='Yes' 
uni=set(a)
sortedMain=sorted([list((i,c)) for c,i in enumerate(a)]) # sort list a
order=[i.pop() for i in sortedMain]
sortedFilt=[b[i] for i in order] # sort list b to keep data matched
sortedMain=[x for i in sortedMain for x in i]
c=[]
for i in uni:
 dex=[c for c,x in enumerate(sortedMain) if x==i]
 test=list(map((lambda x: x.lower().strip()==filterwhen.lower().strip()),[sortedFilt[i] for i in dex]))
 if any(test)==True:
 selecteddex=test.index(True)
 c.append(dex[selecteddex])
 else: c.append(dex[0])
print(c)

a has over 58,000 elements and 26,000+ are unique. Due to all the iterations, it takes too long to run this.

Question 2

Interesting question. Does it correspond to some programming challenge? What does "take precedent" mean ?

Question 3

Nope, this is not part of a programming challenge. By "take precedent" I mean, if an ID has a corresponding 'Yes' in the other data list, I want that index over a 'No'. For example, ID '1' is repeated several times in a. I would like to get a returned index value of '2' for that ID since a 'Yes' was present. Does that make sense?

Question 4

I've added comments to the print statement output to make it clearer the indexes I would like to select from that data.

Question 5

Where do your a and b come from? I get the feeling that you are trying to do something the hard way.

Question 6

a and b are columns in an excel spreadsheet that have been pulled into Python.

Question 7

Consider doing this only with a single for loop, i.e. zip the two lists, sort them by IDs and Yes/No (Yes first so those end up in front in each block), enumerate and then, while keeping track of the current block ID, simply always take the first element from each block and append it to c. E.g.:

a=[1,1,1,2,2,3,3,1]
b=['No','Yes','No','Yes','No','No','No','No']
d = list(enumerate(zip(a, b)))
def compare(x, y):
 same_ids = cmp(x[0], y[0])
 return -cmp(x[1], y[1]) if same_ids == 0 else same_ids
d.sort(cmp=compare, key=lambda x: x[1])
c = []
last = None
for i, (x, y) in d:
 if last != x or last == None:
 c.append(i)
 last = x

At the price of more memory consumption you could also just do a linear scan and keep a dictionary around that records the best choice per ID so far. E.g.:

from itertools import izip
a=[1,1,1,2,2,3,3,1]
b=['No','Yes','No','Yes','No','No','No','No']
known = {}
for i, (x, y) in enumerate(izip(a, b)):
 current = known.get(x)
 if not current or y == 'Yes' and current[1] == 'No':
 known[x] = (i, y)
c = [i for (i, _) in known.itervalues()]

Question 8

Credit to @ferada. I think this is basically what he/she was saying:

# zip, enumerate and sort data -> [[b[0],a[0],0],...,[b[n],a[n],n]]
# when the data are sorted by b[n] the enumerate holds their original position
filt_var_dex=sorted([list(i)+[c] for c,(i) in enumerate(zip(b,a))],reverse=True)
# pull the sorted 'a' list out -> [2, 1, 3, 3, 2, 1, 1, 1]
sortedvar=[i[1] for i in filt_var_dex]
# due to the sorting, the first occurence of an ID will be 'Yes' and 'No' if 'Yes' is unavailable.
# this will give indexes to unique values of 'a'
keepdex=[sortedvar.index(i) for i in set(sortedvar)]
# since enumerate held the original position pluck that
c=[filt_var_dex[i][-1] for i in keepdex]

Question 9

This answer does not review the posted code, this is an alternative implementation. A good Code Review answer should explain weaknesses in the posted code, and explain the suggested improvements.

Question 10

So, I wrote this, which runs much faster, but still not sure if there is a better way to do this:

# data
a=[1,1,1,1,2,2,3,3]
b=['No','No','Yes','No','Yes','No','No','No']
filterby='Yes'
# get indexes for all elements in 'b' that satisfy condition 2)
whendex=[c for c,i in enumerate(b) if i.lower().strip()==filterwhen.lower().strip()]
# get IDs from 'a' where condition 2) satisfied 
mainWhenTrue=[a[i] for i in whendex]
# get indexes for unduplicated IDs from 'a' where condition 2) is satisfied
mainDexWhenTrue=[whendex[x] for x in [mainWhenTrue.index(i) for i in set(mainWhenTrue)]]
# make a list of IDs in 'a' for which we already have indexes
alreadyhave=[a[i] for i in mainDexWhenTrue]
# get index of first occurrence of any IDs for which we don't already have
mainDexWhenFalse=[a.index(i) for i in set(a) if i not in alreadyhave]
c=mainDexWhenTrue+mainDexWhenFalse
print(c)

Question 11

Cannot test how fast this is as I'm away from my pc, but this is one way to do it:

indexs = {}
for index, value in enumerate(a):
 if value not in indexs:
 indexs[value] = index
 elif b[index] == "yes":
 indexs[value] = index
c = list(indexs.values())

It will always return the index of the value found in the list if that value has no yes's attributed to it.

Question 12

Just realised you can condense that down to one if with an or, whoops.

Question 13

With Cython the task with 10000 elements is calculated in 84 us. The same Python code needs 4.12 ms, but for Python it is generally not optimal to loop. From the other codes the fastest was Billyoyo with 2.37 ms. I did a conversion to bool outside my functions but on practice it was extremely fast in numpy array.

In [22]: %timeit res=so9_1.calc_py(a,b_bool,np.max(a))
100 loops, best of 3: 4.12 ms per loop
In [23]: %timeit res=so9_2.calc_cy(a,b_bool.view(dtype=np.int8),np.max(a))
10000 loops, best of 3: 84.6 us per loop
In [24]: %timeit res=so9_LMc.calc_LMc(a,b)
1 loop, best of 3: 358 ms per loop
In [25]: %timeit res=so9_LMcOrig.calc_LMcOrig(a,b)
1 loop, best of 3: 3.66 s per loop
In [26]: %timeit res=so9_ferada2.calc_ferada2(a,b)
100 loops, best of 3: 4.26 ms per loop
In [27]: %timeit res=so9_Billy.calc_Billy(a,b)
100 loops, best of 3: 2.37 ms per loop

The output differs a little between all implementations because you said that if the conditions are met then it does not matter which number is taken.

Code that tests my variants on 1M records. As other codes could not work with that much I later changed the code to 10000 records:

In [1]: import numpy as np
In [2]: import so9_1
In [3]: import so9_2
In [4]: np.random.seed(1)
In [5]: a=np.arange(1,500001)
In [6]: a=np.concatenate((a,np.random.randint(1,500001, size=500000)))
In [7]: a=np.sort(a)
In [8]: b=np.random.rand(np.alen(a))<0.4
In [9]: b_temp=np.empty(np.shape(a),dtype=object)
In [10]: b_temp[b==True]="Yes"
In [11]: b_temp[b==False]="No"
In [12]: b=b_temp
In [13]: #initial state
In [14]: a[:21]
Out[14]:
array([ 1, 1, 1, 2, 3, 4, 5, 6, 6, 6, 7, 7, 8, 8, 9, 10, 11,
 11, 11, 12, 13])
In [15]: b[:21]
Out[15]:
array(['No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes',
 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'No',
 'Yes'], dtype=object)
In [16]: #convert second array to True and False
In [17]: b_bool=np.where(b=="Yes",True,False)
In [18]: %timeit res=so9_1.calc_py(a,b_bool,np.max(a))
1 loop, best of 3: 416 ms per loop
In [19]: %timeit res=so9_2.calc_cy(a,b_bool.view(dtype=np.int8),np.max(a))
100 loops, best of 3: 9.4 ms per loop
In [20]: so9_1.calc_py(a,b_bool,np.max(a))[:11]
Out[20]: array([ 0, 3, 4, 5, 6, 9, 11, 13, 14, 15, 18])
In [21]: so9_2.calc_cy(a,b_bool.view(dtype=np.int8),np.max(a))[:11]
Out[21]: array([ 0, 3, 4, 5, 6, 9, 11, 13, 14, 15, 18])

python so9_1.py:

import numpy as np
def calc_py(a,b,m):
 res=np.empty(m,dtype=int)
 last_a=0
 for i in range(np.alen(a)):
 a_i=a[i]
 b_i=b[i]
 if a_i>last_a or b_i:
 res[a_i-1]=i
 last_a=a_i
 return res

cython so9_2.pyx file:

#cython: boundscheck=False, wraparound=False, nonecheck=False
import cython
import numpy as np
cimport numpy as np
from numpy cimport ndarray
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.infer_types(True)
def calc_cy(ndarray[np.int_t, ndim=1] a,ndarray[np.int8_t, ndim=1] b,int m):
 cdef ndarray[np.int_t, ndim=1] res=np.empty(m,dtype=int)
 cdef int last_a=0
 cdef int i
 cdef int a_i
 cdef np.int8_t b_i
 for i in range(np.alen(a)):
 a_i=a[i]
 b_i=b[i]
 if a_i>last_a or b_i==1:
 res[a_i-1]=i
 last_a=a_i
 return res

setup_so9_2.py:

from distutils.core import setup
from Cython.Build import cythonize
import numpy as np
setup(
 name = 'App',
 ext_modules = cythonize("so9_2.pyx"),
 include_dirs=[np.get_include()],
)

Command line to compile:

python setup_so9_2.py build_ext --inplace

You should have so9_2.pyd file in the directory for the code to run

ferada ferada 11.4k25 silver badges65 bronze badges · Answer 1 · 2016-07-25 21:35:26Z

Consider doing this only with a single for loop, i.e. zip the two lists, sort them by IDs and Yes/No (Yes first so those end up in front in each block), enumerate and then, while keeping track of the current block ID, simply always take the first element from each block and append it to c. E.g.:

a=[1,1,1,2,2,3,3,1]
b=['No','Yes','No','Yes','No','No','No','No']
d = list(enumerate(zip(a, b)))
def compare(x, y):
 same_ids = cmp(x[0], y[0])
 return -cmp(x[1], y[1]) if same_ids == 0 else same_ids
d.sort(cmp=compare, key=lambda x: x[1])
c = []
last = None
for i, (x, y) in d:
 if last != x or last == None:
 c.append(i)
 last = x

At the price of more memory consumption you could also just do a linear scan and keep a dictionary around that records the best choice per ID so far. E.g.:

from itertools import izip
a=[1,1,1,2,2,3,3,1]
b=['No','Yes','No','Yes','No','No','No','No']
known = {}
for i, (x, y) in enumerate(izip(a, b)):
 current = known.get(x)
 if not current or y == 'Yes' and current[1] == 'No':
 known[x] = (i, y)
c = [i for (i, _) in known.itervalues()]

LMc LMc 1996 bronze badges · Answer 2 · 2016-07-25 22:13:21Z

Credit to @ferada. I think this is basically what he/she was saying:

# zip, enumerate and sort data -> [[b[0],a[0],0],...,[b[n],a[n],n]]
# when the data are sorted by b[n] the enumerate holds their original position
filt_var_dex=sorted([list(i)+[c] for c,(i) in enumerate(zip(b,a))],reverse=True)
# pull the sorted 'a' list out -> [2, 1, 3, 3, 2, 1, 1, 1]
sortedvar=[i[1] for i in filt_var_dex]
# due to the sorting, the first occurence of an ID will be 'Yes' and 'No' if 'Yes' is unavailable.
# this will give indexes to unique values of 'a'
keepdex=[sortedvar.index(i) for i in set(sortedvar)]
# since enumerate held the original position pluck that
c=[filt_var_dex[i][-1] for i in keepdex]

This answer does not review the posted code, this is an alternative implementation. A good Code Review answer should explain weaknesses in the posted code, and explain the suggested improvements.

LMc LMc 1996 bronze badges · Answer 3 · 2016-07-25 21:37:46Z

So, I wrote this, which runs much faster, but still not sure if there is a better way to do this:

# data
a=[1,1,1,1,2,2,3,3]
b=['No','No','Yes','No','Yes','No','No','No']
filterby='Yes'
# get indexes for all elements in 'b' that satisfy condition 2)
whendex=[c for c,i in enumerate(b) if i.lower().strip()==filterwhen.lower().strip()]
# get IDs from 'a' where condition 2) satisfied 
mainWhenTrue=[a[i] for i in whendex]
# get indexes for unduplicated IDs from 'a' where condition 2) is satisfied
mainDexWhenTrue=[whendex[x] for x in [mainWhenTrue.index(i) for i in set(mainWhenTrue)]]
# make a list of IDs in 'a' for which we already have indexes
alreadyhave=[a[i] for i in mainDexWhenTrue]
# get index of first occurrence of any IDs for which we don't already have
mainDexWhenFalse=[a.index(i) for i in set(a) if i not in alreadyhave]
c=mainDexWhenTrue+mainDexWhenFalse
print(c)

Billyoyo Billyoyo 1216 bronze badges · Answer 4 · 2016-07-25 21:52:55Z

Cannot test how fast this is as I'm away from my pc, but this is one way to do it:

indexs = {}
for index, value in enumerate(a):
 if value not in indexs:
 indexs[value] = index
 elif b[index] == "yes":
 indexs[value] = index
c = list(indexs.values())

It will always return the index of the value found in the list if that value has no yes's attributed to it.

Just realised you can condense that down to one if with an or, whoops.

keiv.fly keiv.fly 1565 bronze badges · Answer 5 · 2016-07-28 23:06:37Z

With Cython the task with 10000 elements is calculated in 84 us. The same Python code needs 4.12 ms, but for Python it is generally not optimal to loop. From the other codes the fastest was Billyoyo with 2.37 ms. I did a conversion to bool outside my functions but on practice it was extremely fast in numpy array.

In [22]: %timeit res=so9_1.calc_py(a,b_bool,np.max(a))
100 loops, best of 3: 4.12 ms per loop
In [23]: %timeit res=so9_2.calc_cy(a,b_bool.view(dtype=np.int8),np.max(a))
10000 loops, best of 3: 84.6 us per loop
In [24]: %timeit res=so9_LMc.calc_LMc(a,b)
1 loop, best of 3: 358 ms per loop
In [25]: %timeit res=so9_LMcOrig.calc_LMcOrig(a,b)
1 loop, best of 3: 3.66 s per loop
In [26]: %timeit res=so9_ferada2.calc_ferada2(a,b)
100 loops, best of 3: 4.26 ms per loop
In [27]: %timeit res=so9_Billy.calc_Billy(a,b)
100 loops, best of 3: 2.37 ms per loop

The output differs a little between all implementations because you said that if the conditions are met then it does not matter which number is taken.

Code that tests my variants on 1M records. As other codes could not work with that much I later changed the code to 10000 records:

In [1]: import numpy as np
In [2]: import so9_1
In [3]: import so9_2
In [4]: np.random.seed(1)
In [5]: a=np.arange(1,500001)
In [6]: a=np.concatenate((a,np.random.randint(1,500001, size=500000)))
In [7]: a=np.sort(a)
In [8]: b=np.random.rand(np.alen(a))<0.4
In [9]: b_temp=np.empty(np.shape(a),dtype=object)
In [10]: b_temp[b==True]="Yes"
In [11]: b_temp[b==False]="No"
In [12]: b=b_temp
In [13]: #initial state
In [14]: a[:21]
Out[14]:
array([ 1, 1, 1, 2, 3, 4, 5, 6, 6, 6, 7, 7, 8, 8, 9, 10, 11,
 11, 11, 12, 13])
In [15]: b[:21]
Out[15]:
array(['No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes',
 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'No',
 'Yes'], dtype=object)
In [16]: #convert second array to True and False
In [17]: b_bool=np.where(b=="Yes",True,False)
In [18]: %timeit res=so9_1.calc_py(a,b_bool,np.max(a))
1 loop, best of 3: 416 ms per loop
In [19]: %timeit res=so9_2.calc_cy(a,b_bool.view(dtype=np.int8),np.max(a))
100 loops, best of 3: 9.4 ms per loop
In [20]: so9_1.calc_py(a,b_bool,np.max(a))[:11]
Out[20]: array([ 0, 3, 4, 5, 6, 9, 11, 13, 14, 15, 18])
In [21]: so9_2.calc_cy(a,b_bool.view(dtype=np.int8),np.max(a))[:11]
Out[21]: array([ 0, 3, 4, 5, 6, 9, 11, 13, 14, 15, 18])

python so9_1.py:

import numpy as np
def calc_py(a,b,m):
 res=np.empty(m,dtype=int)
 last_a=0
 for i in range(np.alen(a)):
 a_i=a[i]
 b_i=b[i]
 if a_i>last_a or b_i:
 res[a_i-1]=i
 last_a=a_i
 return res

cython so9_2.pyx file:

#cython: boundscheck=False, wraparound=False, nonecheck=False
import cython
import numpy as np
cimport numpy as np
from numpy cimport ndarray
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.infer_types(True)
def calc_cy(ndarray[np.int_t, ndim=1] a,ndarray[np.int8_t, ndim=1] b,int m):
 cdef ndarray[np.int_t, ndim=1] res=np.empty(m,dtype=int)
 cdef int last_a=0
 cdef int i
 cdef int a_i
 cdef np.int8_t b_i
 for i in range(np.alen(a)):
 a_i=a[i]
 b_i=b[i]
 if a_i>last_a or b_i==1:
 res[a_i-1]=i
 last_a=a_i
 return res

setup_so9_2.py:

from distutils.core import setup
from Cython.Build import cythonize
import numpy as np
setup(
 name = 'App',
 ext_modules = cythonize("so9_2.pyx"),
 include_dirs=[np.get_include()],
)

Command line to compile:

python setup_so9_2.py build_ext --inplace

You should have so9_2.pyd file in the directory for the code to run

Stack Exchange Network

Get a list of indexes for list A based on criteria in list B in Python

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Get a list of indexes for list A based on criteria in list B in Python

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions