Given:
An array c that contains the group assignment of every user (
c[i]=4
indicates that useri
belongs to group 4)A matrix y of 0 and 1 that indicates whether user
i
interacted with userj
(y[i][j] = 1
ifi
interacted withj
, 0 otherwise)
What is the fastest way to count the number of interactions and lack of interactions between every pair of groups?
This is what I have:
Solution 1:
This is what I have:
import numpy as np
def test1(y, c):
N = len(c) # number of users
C = len(set(c)) # number of clusters
# Final counts matrices
I = np.zeros((C,C))
notI = np.zeros((C,C))
# Loop every row i/ column j, look at the groups where the users belong
# and update I accordingly
for i in range(N):
for j in range(N):
if y[i][j] == 1:
I[c[i]][c[j]] += 1
else:
notI[c[i]][c[j]] += 1
return I, notI
timeit:
nusers = 100
nclusters = 4
y = np.random.random_integers(0, 1, (nusers,nusers))
c = np.random.choice(nclusters, nusers)
%timeit test1(y,c)
10 loops, best of 3: 42 ms per loop
Solution 2:
def test2(y, c):
N = len(c)
C = len(set(c))
I = np.zeros((C,C))
notI = np.zeros((C,C))
for i, jj in enumerate(y):
for j, value in enumerate(jj):
if value:
I[c[i]][c[j]] += 1
else:
notI[c[i]][c[j]] += 1
return I, notI
timeit: 10 loops, best of 3: 30.8 ms per loop
1 Answer 1
Re-formulating the problem into a matrix multiplication one speeds up the algorithm very much.
Algebra:
Create a matrix G
of n_groups
x n_users
such that every column is a group mask, that is, all users (columns) that belong to that group (row) are flagged are 1 and 0 otherwise. If Y
is the matrix of interactions, then the interactions between every pair of groups can be obtained by the following equation:
$$I = GYG^T$$
Code:
The code is:
import numpy as np
from numpy import dot
def test3(y, c):
N = len(c)
C = len(set(c))
noty = np.mod(y+1,2)
# create C group masks of 1 (if user is belongs to group) and 0.
G = np.zeros((C,N))
for group in range(C):
G[group] = [1 if c_== group else 0 for c_ in c]
# Groups (src) * interaction_matrix * Groups_transposed (dst)
# Every row of Groups matrix is the mask of a group
return dot(dot(G,y),G.T), dot(dot(G,noty),G.T)
1000 loops, best of 3: 810 μs per loop
len(z)
, that is, "z" instead of "c". Typo?z
is not defined in the posted code. \$\endgroup\$if y[i][j] == 1:
toif y[i][j]:
? \$\endgroup\$y[i][j]:
seems a bit faster) \$\endgroup\$