I have one list of data with duplicate IDs and a second list of data with attributes for the IDs.
>>>a
[1,1,1,1,2,2,3,3]
>>>b
['No','No','Yes','No','Yes','No','No','No']
I want to create a list (c
) that holds indexes for a
that match the following criteria:
1) Only one index per ID in a
2) Indexes for IDs that have a 'Yes' in b
take precedent
3) For IDs that don't have a 'Yes' any index is fine.
Using the above data as an example, the output would look like:
>>>c
[2,4,6]
So, you can see the data more easily:
>>> for c,(x,y) in enumerate(zip(a,b)): print(c,x,y)
...
0 1 No
1 1 No
2 1 Yes # grab this index ID 1
3 1 No
4 2 Yes # grab this index for ID 2
5 2 No
6 3 No # grab this index, but either would have been fine for ID 3
7 3 No
Below is code that I've written. I am looking for a more efficient and Pythonic way to do it.
filterwhen='Yes'
uni=set(a)
sortedMain=sorted([list((i,c)) for c,i in enumerate(a)]) # sort list a
order=[i.pop() for i in sortedMain]
sortedFilt=[b[i] for i in order] # sort list b to keep data matched
sortedMain=[x for i in sortedMain for x in i]
c=[]
for i in uni:
dex=[c for c,x in enumerate(sortedMain) if x==i]
test=list(map((lambda x: x.lower().strip()==filterwhen.lower().strip()),[sortedFilt[i] for i in dex]))
if any(test)==True:
selecteddex=test.index(True)
c.append(dex[selecteddex])
else: c.append(dex[0])
print(c)
a
has over 58,000 elements and 26,000+ are unique. Due to all the iterations, it takes too long to run this.
5 Answers 5
Consider doing this only with a single for
loop, i.e. zip
the two
lists, sort them by IDs and Yes
/No
(Yes
first so those end up in
front in each block), enumerate
and then, while keeping track of the
current block ID, simply always take the first element from each block
and append it to c
. E.g.:
a=[1,1,1,2,2,3,3,1]
b=['No','Yes','No','Yes','No','No','No','No']
d = list(enumerate(zip(a, b)))
def compare(x, y):
same_ids = cmp(x[0], y[0])
return -cmp(x[1], y[1]) if same_ids == 0 else same_ids
d.sort(cmp=compare, key=lambda x: x[1])
c = []
last = None
for i, (x, y) in d:
if last != x or last == None:
c.append(i)
last = x
At the price of more memory consumption you could also just do a linear scan and keep a dictionary around that records the best choice per ID so far. E.g.:
from itertools import izip
a=[1,1,1,2,2,3,3,1]
b=['No','Yes','No','Yes','No','No','No','No']
known = {}
for i, (x, y) in enumerate(izip(a, b)):
current = known.get(x)
if not current or y == 'Yes' and current[1] == 'No':
known[x] = (i, y)
c = [i for (i, _) in known.itervalues()]
Credit to @ferada. I think this is basically what he/she was saying:
# zip, enumerate and sort data -> [[b[0],a[0],0],...,[b[n],a[n],n]]
# when the data are sorted by b[n] the enumerate holds their original position
filt_var_dex=sorted([list(i)+[c] for c,(i) in enumerate(zip(b,a))],reverse=True)
# pull the sorted 'a' list out -> [2, 1, 3, 3, 2, 1, 1, 1]
sortedvar=[i[1] for i in filt_var_dex]
# due to the sorting, the first occurence of an ID will be 'Yes' and 'No' if 'Yes' is unavailable.
# this will give indexes to unique values of 'a'
keepdex=[sortedvar.index(i) for i in set(sortedvar)]
# since enumerate held the original position pluck that
c=[filt_var_dex[i][-1] for i in keepdex]
-
\$\begingroup\$ This answer does not review the posted code, this is an alternative implementation. A good Code Review answer should explain weaknesses in the posted code, and explain the suggested improvements. \$\endgroup\$janos– janos2016年07月31日 06:47:29 +00:00Commented Jul 31, 2016 at 6:47
So, I wrote this, which runs much faster, but still not sure if there is a better way to do this:
# data
a=[1,1,1,1,2,2,3,3]
b=['No','No','Yes','No','Yes','No','No','No']
filterby='Yes'
# get indexes for all elements in 'b' that satisfy condition 2)
whendex=[c for c,i in enumerate(b) if i.lower().strip()==filterwhen.lower().strip()]
# get IDs from 'a' where condition 2) satisfied
mainWhenTrue=[a[i] for i in whendex]
# get indexes for unduplicated IDs from 'a' where condition 2) is satisfied
mainDexWhenTrue=[whendex[x] for x in [mainWhenTrue.index(i) for i in set(mainWhenTrue)]]
# make a list of IDs in 'a' for which we already have indexes
alreadyhave=[a[i] for i in mainDexWhenTrue]
# get index of first occurrence of any IDs for which we don't already have
mainDexWhenFalse=[a.index(i) for i in set(a) if i not in alreadyhave]
c=mainDexWhenTrue+mainDexWhenFalse
print(c)
Cannot test how fast this is as I'm away from my pc, but this is one way to do it:
indexs = {}
for index, value in enumerate(a):
if value not in indexs:
indexs[value] = index
elif b[index] == "yes":
indexs[value] = index
c = list(indexs.values())
It will always return the index of the value found in the list if that value has no yes's attributed to it.
-
\$\begingroup\$ Just realised you can condense that down to one if with an or, whoops. \$\endgroup\$Billyoyo– Billyoyo2016年07月25日 21:54:17 +00:00Commented Jul 25, 2016 at 21:54
With Cython the task with 10000 elements is calculated in 84 us. The same Python code needs 4.12 ms, but for Python it is generally not optimal to loop. From the other codes the fastest was Billyoyo with 2.37 ms. I did a conversion to bool outside my functions but on practice it was extremely fast in numpy array.
In [22]: %timeit res=so9_1.calc_py(a,b_bool,np.max(a))
100 loops, best of 3: 4.12 ms per loop
In [23]: %timeit res=so9_2.calc_cy(a,b_bool.view(dtype=np.int8),np.max(a))
10000 loops, best of 3: 84.6 us per loop
In [24]: %timeit res=so9_LMc.calc_LMc(a,b)
1 loop, best of 3: 358 ms per loop
In [25]: %timeit res=so9_LMcOrig.calc_LMcOrig(a,b)
1 loop, best of 3: 3.66 s per loop
In [26]: %timeit res=so9_ferada2.calc_ferada2(a,b)
100 loops, best of 3: 4.26 ms per loop
In [27]: %timeit res=so9_Billy.calc_Billy(a,b)
100 loops, best of 3: 2.37 ms per loop
The output differs a little between all implementations because you said that if the conditions are met then it does not matter which number is taken.
Code that tests my variants on 1M records. As other codes could not work with that much I later changed the code to 10000 records:
In [1]: import numpy as np
In [2]: import so9_1
In [3]: import so9_2
In [4]: np.random.seed(1)
In [5]: a=np.arange(1,500001)
In [6]: a=np.concatenate((a,np.random.randint(1,500001, size=500000)))
In [7]: a=np.sort(a)
In [8]: b=np.random.rand(np.alen(a))<0.4
In [9]: b_temp=np.empty(np.shape(a),dtype=object)
In [10]: b_temp[b==True]="Yes"
In [11]: b_temp[b==False]="No"
In [12]: b=b_temp
In [13]: #initial state
In [14]: a[:21]
Out[14]:
array([ 1, 1, 1, 2, 3, 4, 5, 6, 6, 6, 7, 7, 8, 8, 9, 10, 11,
11, 11, 12, 13])
In [15]: b[:21]
Out[15]:
array(['No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes',
'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'No',
'Yes'], dtype=object)
In [16]: #convert second array to True and False
In [17]: b_bool=np.where(b=="Yes",True,False)
In [18]: %timeit res=so9_1.calc_py(a,b_bool,np.max(a))
1 loop, best of 3: 416 ms per loop
In [19]: %timeit res=so9_2.calc_cy(a,b_bool.view(dtype=np.int8),np.max(a))
100 loops, best of 3: 9.4 ms per loop
In [20]: so9_1.calc_py(a,b_bool,np.max(a))[:11]
Out[20]: array([ 0, 3, 4, 5, 6, 9, 11, 13, 14, 15, 18])
In [21]: so9_2.calc_cy(a,b_bool.view(dtype=np.int8),np.max(a))[:11]
Out[21]: array([ 0, 3, 4, 5, 6, 9, 11, 13, 14, 15, 18])
python so9_1.py:
import numpy as np
def calc_py(a,b,m):
res=np.empty(m,dtype=int)
last_a=0
for i in range(np.alen(a)):
a_i=a[i]
b_i=b[i]
if a_i>last_a or b_i:
res[a_i-1]=i
last_a=a_i
return res
cython so9_2.pyx file:
#cython: boundscheck=False, wraparound=False, nonecheck=False
import cython
import numpy as np
cimport numpy as np
from numpy cimport ndarray
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.infer_types(True)
def calc_cy(ndarray[np.int_t, ndim=1] a,ndarray[np.int8_t, ndim=1] b,int m):
cdef ndarray[np.int_t, ndim=1] res=np.empty(m,dtype=int)
cdef int last_a=0
cdef int i
cdef int a_i
cdef np.int8_t b_i
for i in range(np.alen(a)):
a_i=a[i]
b_i=b[i]
if a_i>last_a or b_i==1:
res[a_i-1]=i
last_a=a_i
return res
setup_so9_2.py:
from distutils.core import setup
from Cython.Build import cythonize
import numpy as np
setup(
name = 'App',
ext_modules = cythonize("so9_2.pyx"),
include_dirs=[np.get_include()],
)
Command line to compile:
python setup_so9_2.py build_ext --inplace
You should have so9_2.pyd file in the directory for the code to run
a
. I would like to get a returned index value of '2' for that ID since a 'Yes' was present. Does that make sense? \$\endgroup\$a
andb
come from? I get the feeling that you are trying to do something the hard way. \$\endgroup\$a
andb
are columns in an excel spreadsheet that have been pulled into Python. \$\endgroup\$