Conditional removal of columns in sparse matrix

Question 1

I have a large dataset (78k instances x 490k features) that is loaded as a scipy.sparse.csr_matrix format. From this dataset I want to filter certain features (i.e. columns) for which all values fall below a certain threshold.

Loading the dataset as a dense matrix is not an option, nor did I find sparse matrix operation that do the job (please correct me if I am wrong on the latter). So I took a column-iteration approach for each feature group using multiprocessing:

Divide the total column indices in n = n_cores roughly equal groups.
For every index group spawn a process that iterates over each column and use buildin .all() to check the comparison condition. Collect all indices that should be deleted in list (order does not matter).
Drop the columns in the full dataset matrix X based on the indices list.

On a [email protected] machine this takes 42 minutes on my dataset. I feel that especially the .all() conditional check in .get_filtered_cols should be optimized. Any other recommendations are certainly welcome.

Code with smaller simulated dataset:

import numpy as np
from scipy.sparse import csr_matrix
import multiprocessing
# Initiate simulated random sparse csr matrix as dataset X. Actual use case is 78k x 490k.
N = 780; M = 4900
X = np.random.choice([0, 1, 2, 3, 4], size=(N,M), p=[0.99, 0.005, 0.0025, 0.0015, 0.001]) # this is a rough
# simulation of the type of data in the use case (of course upperbound of some features is much higher)
X = csr_matrix(X, dtype=np.float32) # the real-use svmlight dataset can only be loaded as sparse.csr_matrix
# The settings of the feature groups to be filtered. Contains the range of the feature group in the dataset and the
# threshold value.
ngram_fg_dict = {"featuregroup_01": {"threshold": 3, "start_idx": 0, "end_idx": 2450},
 "featuregroup_02": {"threshold": 4, "start_idx": 2451, "end_idx": 4900}}
n_cores = 3
def list_split(lst, n):
 '''Split a list into roughly equal n groups'''
 k, m = int(len(lst) / n), int(len(lst) % n)
 return [lst[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in list(range(n))]
def get_filtered_cols(indices):
 '''Takes a list of column indices of the dataset to check if all column values are smaller than k'''
 col_idc_delete = []
 for i in indices:
 col = fg_X[:,i].toarray().flatten()
 if all(i < v["threshold"] for i in col):
 col_idc_delete.append(i+v["start_idx"]) #these are the indices for the original dataset (not yet sliced)
 return col_idc_delete
def drop_cols(M, idx_to_drop):
 '''Remove columns from matrix M given a list of column indices to remove.'''
 idx_to_drop = np.unique(idx_to_drop)
 C = M.tocoo()
 keep = ~np.in1d(C.col, idx_to_drop)
 C.data, C.row, C.col = C.data[keep], C.row[keep], C.col[keep]
 C.col -= idx_to_drop.searchsorted(C.col) # decrement column indices
 C._shape = (C.shape[0], C.shape[1] - len(idx_to_drop))
 return C.tocsr()
all_idc_delete = []
for k, v in ngram_fg_dict.items():
 if v["threshold"] > 1: # '1' in threshold_dict means 'set no threshold' given our dataset
 fg_X = X[:,v["start_idx"]:v["end_idx"]] # slice feature group to be filtered
 l = fg_X.shape[1] # total amount of columns
 # split the feature column indices list in groups for multiprocessing, the conditional check is to remove
 # potential empty lists resulting from list_split
 mp_groups = [lgroup for lgroup in list_split(list(range(l)), n_cores) if lgroup != []]
 p = multiprocessing.Pool(len(mp_groups))
 print("Filtering %s < %d with %d processes" % (k, v["threshold"], len(mp_groups)))
 fg_idc_delete = p.imap(get_filtered_cols, mp_groups) #imap faster than map, order of returned result column
 # indices does not matter
 all_idc_delete.extend([item for sublist in fg_idc_delete for item in sublist]) #flatten before extending to
 # all indices to delete list
print("Deleting %s columns." % (len(all_idc_delete)))
X_new = drop_cols(X, all_idc_delete)

Benchmarking this 30x: time avg: 2.67 sec, best: 2.41 sec. on my local machine.

Question 2

Assuming that the threshold is positive, then you can use the >= operator to construct a sparse Boolean array indicating which points are above or equal to the threshold:

# m is your dataset in sparse matrix representation
above_threshold = m >= v["threshold"]

and then you can use the max method to get the maximum entry in each column:

cols = above_threshold.max(axis=0)

This will be 1 for columns that have any value greater than or equal to the threshold, and 0 for columns where all values are below the threshold. So cols is a mask for the columns you want to keep. (If you need a Boolean array, then use cols == 1.)

(Updated after discussion in comments. I had some more complicated suggestions, but simpler is better.)

Question 3

Thanks. The speed advantage of converting to csc is marginal though as we only have to slice along the column for each feature group (max 4). I benchmarked with csc and average time improvement is 3.2%, best time improvement is 8.5%.

Question 4

@GJacobs: did you re-implement the column filtering operation to take advantage of the CSC format?

Question 5

No, because -to my knowledge- there is no way to perform a boolean mask or .all()-like function directly on a sparse matrix. So I end up iterating over columns in the exact same way as currently implemented. A way to perform the conditional check in .all() efficiently without iterating over every column (either over the full matrix or in batches) would solve this problem.

Question 6

@GJacobs: see updated post.

Question 7

Don't. If you want you can put it as an answer, but do not put it in the original, as it will make it very hard to understand what happened.

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2016-08-16 15:09:59Z

5

\$\begingroup\$

Assuming that the threshold is positive, then you can use the >= operator to construct a sparse Boolean array indicating which points are above or equal to the threshold:

# m is your dataset in sparse matrix representation
above_threshold = m >= v["threshold"]

and then you can use the max method to get the maximum entry in each column:

cols = above_threshold.max(axis=0)

This will be 1 for columns that have any value greater than or equal to the threshold, and 0 for columns where all values are below the threshold. So cols is a mask for the columns you want to keep. (If you need a Boolean array, then use cols == 1.)

(Updated after discussion in comments. I had some more complicated suggestions, but simpler is better.)

Share

edited Aug 16, 2016 at 16:50

answered Aug 16, 2016 at 15:09

Gareth Rees's user avatar

Gareth Rees Gareth Rees

50.1k3 gold badges130 silver badges210 bronze badges

\$\endgroup\$

7

\$\begingroup\$ Thanks. The speed advantage of converting to csc is marginal though as we only have to slice along the column for each feature group (max 4). I benchmarked with csc and average time improvement is 3.2%, best time improvement is 8.5%. \$\endgroup\$

GJacobs
– GJacobs

2016年08月16日 15:36:46 +00:00
Commented Aug 16, 2016 at 15:36
1

\$\begingroup\$ @GJacobs: did you re-implement the column filtering operation to take advantage of the CSC format? \$\endgroup\$

Gareth Rees
– Gareth Rees

2016年08月16日 15:44:28 +00:00
Commented Aug 16, 2016 at 15:44
\$\begingroup\$ No, because -to my knowledge- there is no way to perform a boolean mask or .all()-like function directly on a sparse matrix. So I end up iterating over columns in the exact same way as currently implemented. A way to perform the conditional check in .all() efficiently without iterating over every column (either over the full matrix or in batches) would solve this problem. \$\endgroup\$

GJacobs
– GJacobs

2016年08月16日 16:04:04 +00:00
Commented Aug 16, 2016 at 16:04
1

\$\begingroup\$ @GJacobs: see updated post. \$\endgroup\$

Gareth Rees
– Gareth Rees

2016年08月16日 16:15:22 +00:00
Commented Aug 16, 2016 at 16:15
1

\$\begingroup\$ Don't. If you want you can put it as an answer, but do not put it in the original, as it will make it very hard to understand what happened. \$\endgroup\$

Oscar Smith
– Oscar Smith

2016年08月16日 18:28:41 +00:00
Commented Aug 16, 2016 at 18:28

| Show 2 more comments

Stack Exchange Network

Conditional removal of columns in sparse matrix

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Conditional removal of columns in sparse matrix

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions