Vectorized crosstabulation in Python for two arrays with two categories each

Question 1

I have two Python lists, label and presence. I want to do cross-tabulation and get the count for each block out of four, such as A, B, C, and D in the below code.

Both the lists have values True and False.
I have tried Pandas' crosstab function. However, it's slower than my code which is below.
One problem with my code is it's not vectorized and is using a for loop which slows things down.

Could the below function in Python be made any faster?

def cross_tab(label,presence):
 A_token=0
 B_token=0
 C_token=0
 D_token=0
 for i,j in zip(list(label),list(presence)):
 if i==True and j==True:
 A_token+=1
 elif i==False and j==False:
 D_token+=1
 elif i==True and j==False:
 C_token+=1
 elif i==False and j==True:
 B_token+=1
 return A_token,B_token,C_token,D_token

Some sample data and example input and output.

##input
label=[True,True,False,False,False,False,True,False,False,True,True,True,True,False]
presence=[True,False,False,True,False,False,True,True,False,True,False,True,False,False]
##processing
A,B,C,D=cross_tab(label,presence)
print('A:',A,'B:',B,'C:',C,'D:',D)
##Output
A: 4 B: 2 C: 3 D: 5

Edit: Answer provided by Maarten Fabre below is working perfectly. To anyone who will stumble here in future, the logic flow is as follows.

Goal: find a way for vectorization: Below are the solution steps

Analyze and find unique value at each evaluation. This will help save logical output in single array.
By multiplying 2 with any given array and adding resultant array with other array we can get results in single array with unique coded value for each logic.
Get count of the unique element in array and fetch values.
Since calculation can be done in arrays without loop, convert list into np array to allow vectorized implementation.

Question 2

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers .

Question 3

I haven't copied any code from answer. You can check question edit history and answer for that. I did not say anywhere in the question that i am using numpy array, but it was suggested by the answerer to use numpy for performance. Let me know if am missing anything.

Question 4

It wasn't an attack, it was a pre-emptive comment. Please don't take it personally.

Question 5

Algorithm

If you look at your code, and follow the if-elif part, you see that there are 4 combinations of i and j

i j : result
True, True : A
False, True : B
True, False : C
False, False : D

If you use the tuple (i, j) as key, you can use a dict lookup

{
 (True, True): "A",
 (False, True): "B",
 (True, False): "C",
 (False, False): "D",
}

Or simpler:

{
 (True, True): 3,
 (False, True): 1,
 (True, False): 2,
 (False, False): 0,
}

The choice of numbers is deliberate, since when you use True as 1 and False as 0, you can do

def crosstab2(label, presence):
 for i, j in zip(label, presence):
 yield i * 2 + j
c = collections.Counter(crosstab2(label, presence))
print('A:',c[3],'B:',c[1],'C:',c[2],'D:',c[0])

This is not faster than your original solution, but this is something you can vectorize

label = np.array([True, True, False, False,False, False,True, False, False, True, True, True, True, False])
presence = np.array([True, False, False, True, False, False, True, True, False, True, False, True, False, False])
c = collections.Counter(label * 2 + presence)
print('A:',c[3],'B:',c[1],'C:',c[2],'D:',c[0])

Which is significantly faster, even if you account for the few seconds of overhead for the creation of the numpy arrays

Formatting

Try to follow pep8.

spaces around operators (=, +, ...)
spaces after a ,

naming

I try to give collections of elements a plural name. In this case, I would use labels., so if you ever need to iterate over them, you can do for label in labels, which is a lot more clear than for i in label:

`list`

The extra call to list in zip(list(label),list(presence)) is not necessary. zip takes any iterable, and doesn't modify it in place.

Question 6

Thank you for your help! It is faster solution indeed. One thing though, i am trying to understand your solution from data structure algorithm perspective. Can you suggest how you broke the task and deviced the solution in that regard? I will really appreciate your guidance and suggestion which can help me develop that thinking.

Question 7

is this explanation clearer?

Question 8

I summarize logic below. Let me know if it's correct, I will add in the question as edit. Goal: find a way for vectorization: step1: Analyze and find unique value at each evaluation. This will help save logical output in single array. step2: By multiplying 2 with any given array and adding resultant array with other array we can get results in single array with unique coded value for each logic. Step3: Get count of the unique element in array and fetch values. Step4: since calculaton can be done in arrays without loop, convert list into np array to allow vectorized implementation.

score 4 · Accepted Answer · 2020-08-04 08:54:13Z

Algorithm

If you look at your code, and follow the if-elif part, you see that there are 4 combinations of i and j

i j : result
True, True : A
False, True : B
True, False : C
False, False : D

If you use the tuple (i, j) as key, you can use a dict lookup

{
 (True, True): "A",
 (False, True): "B",
 (True, False): "C",
 (False, False): "D",
}

Or simpler:

{
 (True, True): 3,
 (False, True): 1,
 (True, False): 2,
 (False, False): 0,
}

The choice of numbers is deliberate, since when you use True as 1 and False as 0, you can do

def crosstab2(label, presence):
 for i, j in zip(label, presence):
 yield i * 2 + j
c = collections.Counter(crosstab2(label, presence))
print('A:',c[3],'B:',c[1],'C:',c[2],'D:',c[0])

This is not faster than your original solution, but this is something you can vectorize

label = np.array([True, True, False, False,False, False,True, False, False, True, True, True, True, False])
presence = np.array([True, False, False, True, False, False, True, True, False, True, False, True, False, False])
c = collections.Counter(label * 2 + presence)
print('A:',c[3],'B:',c[1],'C:',c[2],'D:',c[0])

Which is significantly faster, even if you account for the few seconds of overhead for the creation of the numpy arrays

Formatting

Try to follow pep8.

spaces around operators (=, +, ...)
spaces after a ,

naming

I try to give collections of elements a plural name. In this case, I would use labels., so if you ever need to iterate over them, you can do for label in labels, which is a lot more clear than for i in label:

`list`

The extra call to list in zip(list(label),list(presence)) is not necessary. zip takes any iterable, and doesn't modify it in place.

Thank you for your help! It is faster solution indeed. One thing though, i am trying to understand your solution from data structure algorithm perspective. Can you suggest how you broke the task and deviced the solution in that regard? I will really appreciate your guidance and suggestion which can help me develop that thinking.
I summarize logic below. Let me know if it's correct, I will add in the question as edit. Goal: find a way for vectorization: step1: Analyze and find unique value at each evaluation. This will help save logical output in single array. step2: By multiplying 2 with any given array and adding resultant array with other array we can get results in single array with unique coded value for each logic. Step3: Get count of the unique element in array and fetch values. Step4: since calculaton can be done in arrays without loop, convert list into np array to allow vectorized implementation.

Stack Exchange Network

Vectorized crosstabulation in Python for two arrays with two categories each

1 Answer 1

Algorithm

Formatting

naming

`list`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Vectorized crosstabulation in Python for two arrays with two categories each

1 Answer 1

Algorithm

Formatting

naming

list

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`list`