6
\$\begingroup\$

I have one list of data with duplicate IDs and a second list of data with attributes for the IDs.

>>>a
[1,1,1,1,2,2,3,3]
>>>b
['No','No','Yes','No','Yes','No','No','No']

I want to create a list (c) that holds indexes for a that match the following criteria:

1) Only one index per ID in a

2) Indexes for IDs that have a 'Yes' in b take precedent

3) For IDs that don't have a 'Yes' any index is fine.

Using the above data as an example, the output would look like:

>>>c
[2,4,6]

So, you can see the data more easily:

>>> for c,(x,y) in enumerate(zip(a,b)): print(c,x,y)
...
0 1 No
1 1 No
2 1 Yes # grab this index ID 1
3 1 No
4 2 Yes # grab this index for ID 2
5 2 No 
6 3 No # grab this index, but either would have been fine for ID 3
7 3 No

Below is code that I've written. I am looking for a more efficient and Pythonic way to do it.

filterwhen='Yes' 
uni=set(a)
sortedMain=sorted([list((i,c)) for c,i in enumerate(a)]) # sort list a
order=[i.pop() for i in sortedMain]
sortedFilt=[b[i] for i in order] # sort list b to keep data matched
sortedMain=[x for i in sortedMain for x in i]
c=[]
for i in uni:
 dex=[c for c,x in enumerate(sortedMain) if x==i]
 test=list(map((lambda x: x.lower().strip()==filterwhen.lower().strip()),[sortedFilt[i] for i in dex]))
 if any(test)==True:
 selecteddex=test.index(True)
 c.append(dex[selecteddex])
 else: c.append(dex[0])
print(c)

a has over 58,000 elements and 26,000+ are unique. Due to all the iterations, it takes too long to run this.

asked Jul 25, 2016 at 20:08
\$\endgroup\$
8
  • \$\begingroup\$ Interesting question. Does it correspond to some programming challenge? What does "take precedent" mean ? \$\endgroup\$ Commented Jul 25, 2016 at 20:28
  • \$\begingroup\$ Nope, this is not part of a programming challenge. By "take precedent" I mean, if an ID has a corresponding 'Yes' in the other data list, I want that index over a 'No'. For example, ID '1' is repeated several times in a. I would like to get a returned index value of '2' for that ID since a 'Yes' was present. Does that make sense? \$\endgroup\$ Commented Jul 25, 2016 at 20:34
  • \$\begingroup\$ I've added comments to the print statement output to make it clearer the indexes I would like to select from that data. \$\endgroup\$ Commented Jul 25, 2016 at 20:43
  • \$\begingroup\$ Where do your a and b come from? I get the feeling that you are trying to do something the hard way. \$\endgroup\$ Commented Jul 25, 2016 at 21:04
  • \$\begingroup\$ a and b are columns in an excel spreadsheet that have been pulled into Python. \$\endgroup\$ Commented Jul 25, 2016 at 21:07

5 Answers 5

3
\$\begingroup\$

Consider doing this only with a single for loop, i.e. zip the two lists, sort them by IDs and Yes/No (Yes first so those end up in front in each block), enumerate and then, while keeping track of the current block ID, simply always take the first element from each block and append it to c. E.g.:

a=[1,1,1,2,2,3,3,1]
b=['No','Yes','No','Yes','No','No','No','No']
d = list(enumerate(zip(a, b)))
def compare(x, y):
 same_ids = cmp(x[0], y[0])
 return -cmp(x[1], y[1]) if same_ids == 0 else same_ids
d.sort(cmp=compare, key=lambda x: x[1])
c = []
last = None
for i, (x, y) in d:
 if last != x or last == None:
 c.append(i)
 last = x

At the price of more memory consumption you could also just do a linear scan and keep a dictionary around that records the best choice per ID so far. E.g.:

from itertools import izip
a=[1,1,1,2,2,3,3,1]
b=['No','Yes','No','Yes','No','No','No','No']
known = {}
for i, (x, y) in enumerate(izip(a, b)):
 current = known.get(x)
 if not current or y == 'Yes' and current[1] == 'No':
 known[x] = (i, y)
c = [i for (i, _) in known.itervalues()]
answered Jul 25, 2016 at 21:35
\$\endgroup\$
2
\$\begingroup\$

Credit to @ferada. I think this is basically what he/she was saying:

# zip, enumerate and sort data -> [[b[0],a[0],0],...,[b[n],a[n],n]]
# when the data are sorted by b[n] the enumerate holds their original position
filt_var_dex=sorted([list(i)+[c] for c,(i) in enumerate(zip(b,a))],reverse=True)
# pull the sorted 'a' list out -> [2, 1, 3, 3, 2, 1, 1, 1]
sortedvar=[i[1] for i in filt_var_dex]
# due to the sorting, the first occurence of an ID will be 'Yes' and 'No' if 'Yes' is unavailable.
# this will give indexes to unique values of 'a'
keepdex=[sortedvar.index(i) for i in set(sortedvar)]
# since enumerate held the original position pluck that
c=[filt_var_dex[i][-1] for i in keepdex]
answered Jul 25, 2016 at 22:13
\$\endgroup\$
1
  • \$\begingroup\$ This answer does not review the posted code, this is an alternative implementation. A good Code Review answer should explain weaknesses in the posted code, and explain the suggested improvements. \$\endgroup\$ Commented Jul 31, 2016 at 6:47
1
\$\begingroup\$

So, I wrote this, which runs much faster, but still not sure if there is a better way to do this:

# data
a=[1,1,1,1,2,2,3,3]
b=['No','No','Yes','No','Yes','No','No','No']
filterby='Yes'
# get indexes for all elements in 'b' that satisfy condition 2)
whendex=[c for c,i in enumerate(b) if i.lower().strip()==filterwhen.lower().strip()]
# get IDs from 'a' where condition 2) satisfied 
mainWhenTrue=[a[i] for i in whendex]
# get indexes for unduplicated IDs from 'a' where condition 2) is satisfied
mainDexWhenTrue=[whendex[x] for x in [mainWhenTrue.index(i) for i in set(mainWhenTrue)]]
# make a list of IDs in 'a' for which we already have indexes
alreadyhave=[a[i] for i in mainDexWhenTrue]
# get index of first occurrence of any IDs for which we don't already have
mainDexWhenFalse=[a.index(i) for i in set(a) if i not in alreadyhave]
c=mainDexWhenTrue+mainDexWhenFalse
print(c)
answered Jul 25, 2016 at 21:37
\$\endgroup\$
1
\$\begingroup\$

Cannot test how fast this is as I'm away from my pc, but this is one way to do it:

indexs = {}
for index, value in enumerate(a):
 if value not in indexs:
 indexs[value] = index
 elif b[index] == "yes":
 indexs[value] = index
c = list(indexs.values())

It will always return the index of the value found in the list if that value has no yes's attributed to it.

answered Jul 25, 2016 at 21:52
\$\endgroup\$
1
  • \$\begingroup\$ Just realised you can condense that down to one if with an or, whoops. \$\endgroup\$ Commented Jul 25, 2016 at 21:54
1
\$\begingroup\$

With Cython the task with 10000 elements is calculated in 84 us. The same Python code needs 4.12 ms, but for Python it is generally not optimal to loop. From the other codes the fastest was Billyoyo with 2.37 ms. I did a conversion to bool outside my functions but on practice it was extremely fast in numpy array.

In [22]: %timeit res=so9_1.calc_py(a,b_bool,np.max(a))
100 loops, best of 3: 4.12 ms per loop
In [23]: %timeit res=so9_2.calc_cy(a,b_bool.view(dtype=np.int8),np.max(a))
10000 loops, best of 3: 84.6 us per loop
In [24]: %timeit res=so9_LMc.calc_LMc(a,b)
1 loop, best of 3: 358 ms per loop
In [25]: %timeit res=so9_LMcOrig.calc_LMcOrig(a,b)
1 loop, best of 3: 3.66 s per loop
In [26]: %timeit res=so9_ferada2.calc_ferada2(a,b)
100 loops, best of 3: 4.26 ms per loop
In [27]: %timeit res=so9_Billy.calc_Billy(a,b)
100 loops, best of 3: 2.37 ms per loop

The output differs a little between all implementations because you said that if the conditions are met then it does not matter which number is taken.

Code that tests my variants on 1M records. As other codes could not work with that much I later changed the code to 10000 records:

In [1]: import numpy as np
In [2]: import so9_1
In [3]: import so9_2
In [4]: np.random.seed(1)
In [5]: a=np.arange(1,500001)
In [6]: a=np.concatenate((a,np.random.randint(1,500001, size=500000)))
In [7]: a=np.sort(a)
In [8]: b=np.random.rand(np.alen(a))<0.4
In [9]: b_temp=np.empty(np.shape(a),dtype=object)
In [10]: b_temp[b==True]="Yes"
In [11]: b_temp[b==False]="No"
In [12]: b=b_temp
In [13]: #initial state
In [14]: a[:21]
Out[14]:
array([ 1, 1, 1, 2, 3, 4, 5, 6, 6, 6, 7, 7, 8, 8, 9, 10, 11,
 11, 11, 12, 13])
In [15]: b[:21]
Out[15]:
array(['No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes',
 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'No',
 'Yes'], dtype=object)
In [16]: #convert second array to True and False
In [17]: b_bool=np.where(b=="Yes",True,False)
In [18]: %timeit res=so9_1.calc_py(a,b_bool,np.max(a))
1 loop, best of 3: 416 ms per loop
In [19]: %timeit res=so9_2.calc_cy(a,b_bool.view(dtype=np.int8),np.max(a))
100 loops, best of 3: 9.4 ms per loop
In [20]: so9_1.calc_py(a,b_bool,np.max(a))[:11]
Out[20]: array([ 0, 3, 4, 5, 6, 9, 11, 13, 14, 15, 18])
In [21]: so9_2.calc_cy(a,b_bool.view(dtype=np.int8),np.max(a))[:11]
Out[21]: array([ 0, 3, 4, 5, 6, 9, 11, 13, 14, 15, 18])

python so9_1.py:

import numpy as np
def calc_py(a,b,m):
 res=np.empty(m,dtype=int)
 last_a=0
 for i in range(np.alen(a)):
 a_i=a[i]
 b_i=b[i]
 if a_i>last_a or b_i:
 res[a_i-1]=i
 last_a=a_i
 return res

cython so9_2.pyx file:

#cython: boundscheck=False, wraparound=False, nonecheck=False
import cython
import numpy as np
cimport numpy as np
from numpy cimport ndarray
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.infer_types(True)
def calc_cy(ndarray[np.int_t, ndim=1] a,ndarray[np.int8_t, ndim=1] b,int m):
 cdef ndarray[np.int_t, ndim=1] res=np.empty(m,dtype=int)
 cdef int last_a=0
 cdef int i
 cdef int a_i
 cdef np.int8_t b_i
 for i in range(np.alen(a)):
 a_i=a[i]
 b_i=b[i]
 if a_i>last_a or b_i==1:
 res[a_i-1]=i
 last_a=a_i
 return res

setup_so9_2.py:

from distutils.core import setup
from Cython.Build import cythonize
import numpy as np
setup(
 name = 'App',
 ext_modules = cythonize("so9_2.pyx"),
 include_dirs=[np.get_include()],
)

Command line to compile:

python setup_so9_2.py build_ext --inplace

You should have so9_2.pyd file in the directory for the code to run

answered Jul 28, 2016 at 23:06
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.