I have two Python lists, label
and presence
. I want to do cross-tabulation and get the count for each block out of four, such as A, B, C, and D in the below code.
- Both the lists have values
True
andFalse
. - I have tried Pandas' crosstab function. However, it's slower than my code which is below.
- One problem with my code is it's not vectorized and is using a for loop which slows things down.
Could the below function in Python be made any faster?
def cross_tab(label,presence):
A_token=0
B_token=0
C_token=0
D_token=0
for i,j in zip(list(label),list(presence)):
if i==True and j==True:
A_token+=1
elif i==False and j==False:
D_token+=1
elif i==True and j==False:
C_token+=1
elif i==False and j==True:
B_token+=1
return A_token,B_token,C_token,D_token
Some sample data and example input and output.
##input
label=[True,True,False,False,False,False,True,False,False,True,True,True,True,False]
presence=[True,False,False,True,False,False,True,True,False,True,False,True,False,False]
##processing
A,B,C,D=cross_tab(label,presence)
print('A:',A,'B:',B,'C:',C,'D:',D)
##Output
A: 4 B: 2 C: 3 D: 5
Edit: Answer provided by Maarten Fabre below is working perfectly. To anyone who will stumble here in future, the logic flow is as follows.
Goal: find a way for vectorization: Below are the solution steps
- Analyze and find unique value at each evaluation. This will help save logical output in single array.
- By multiplying 2 with any given array and adding resultant array with other array we can get results in single array with unique coded value for each logic.
- Get count of the unique element in array and fetch values.
- Since calculation can be done in arrays without loop, convert list into np array to allow vectorized implementation.
-
\$\begingroup\$ Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers . \$\endgroup\$Mast– Mast ♦2020年08月06日 10:06:22 +00:00Commented Aug 6, 2020 at 10:06
-
\$\begingroup\$ I haven't copied any code from answer. You can check question edit history and answer for that. I did not say anywhere in the question that i am using numpy array, but it was suggested by the answerer to use numpy for performance. Let me know if am missing anything. \$\endgroup\$StatguyUser– StatguyUser2020年08月06日 10:11:33 +00:00Commented Aug 6, 2020 at 10:11
-
\$\begingroup\$ It wasn't an attack, it was a pre-emptive comment. Please don't take it personally. \$\endgroup\$Mast– Mast ♦2020年08月06日 11:04:04 +00:00Commented Aug 6, 2020 at 11:04
1 Answer 1
Algorithm
If you look at your code, and follow the if-elif
part, you see that there are 4 combinations of i
and j
i j : result True, True : A False, True : B True, False : C False, False : D
If you use the tuple (i, j)
as key, you can use a dict lookup
{
(True, True): "A",
(False, True): "B",
(True, False): "C",
(False, False): "D",
}
Or simpler:
{
(True, True): 3,
(False, True): 1,
(True, False): 2,
(False, False): 0,
}
The choice of numbers is deliberate, since when you use True
as 1
and False
as 0
, you can do
def crosstab2(label, presence):
for i, j in zip(label, presence):
yield i * 2 + j
c = collections.Counter(crosstab2(label, presence))
print('A:',c[3],'B:',c[1],'C:',c[2],'D:',c[0])
This is not faster than your original solution, but this is something you can vectorize
label = np.array([True, True, False, False,False, False,True, False, False, True, True, True, True, False])
presence = np.array([True, False, False, True, False, False, True, True, False, True, False, True, False, False])
c = collections.Counter(label * 2 + presence)
print('A:',c[3],'B:',c[1],'C:',c[2],'D:',c[0])
Which is significantly faster, even if you account for the few seconds of overhead for the creation of the numpy arrays
Formatting
Try to follow pep8.
- spaces around operators (
=
,+
, ...) - spaces after a
,
naming
I try to give collections of elements a plural name. In this case, I would use labels
., so if you ever need to iterate over them, you can do for label in labels
, which is a lot more clear than for i in label:
list
The extra call to list
in zip(list(label),list(presence))
is not necessary. zip
takes any iterable, and doesn't modify it in place.
-
\$\begingroup\$ Thank you for your help! It is faster solution indeed. One thing though, i am trying to understand your solution from data structure algorithm perspective. Can you suggest how you broke the task and deviced the solution in that regard? I will really appreciate your guidance and suggestion which can help me develop that thinking. \$\endgroup\$StatguyUser– StatguyUser2020年08月04日 11:40:30 +00:00Commented Aug 4, 2020 at 11:40
-
1\$\begingroup\$ is this explanation clearer? \$\endgroup\$Maarten Fabré– Maarten Fabré2020年08月04日 11:56:24 +00:00Commented Aug 4, 2020 at 11:56
-
\$\begingroup\$ I summarize logic below. Let me know if it's correct, I will add in the question as edit. Goal: find a way for vectorization: step1: Analyze and find unique value at each evaluation. This will help save logical output in single array. step2: By multiplying 2 with any given array and adding resultant array with other array we can get results in single array with unique coded value for each logic. Step3: Get count of the unique element in array and fetch values. Step4: since calculaton can be done in arrays without loop, convert list into np array to allow vectorized implementation. \$\endgroup\$StatguyUser– StatguyUser2020年08月04日 13:04:17 +00:00Commented Aug 4, 2020 at 13:04
Explore related questions
See similar questions with these tags.