I have a large dataset (78k instances x 490k features) that is loaded as a scipy.sparse.csr_matrix
format. From this dataset I want to filter certain features (i.e. columns) for which all values fall below a certain threshold.
Loading the dataset as a dense matrix is not an option, nor did I find sparse matrix operation that do the job (please correct me if I am wrong on the latter). So I took a column-iteration approach for each feature group using multiprocessing
:
- Divide the total column indices in
n = n_cores
roughly equal groups. - For every index group spawn a process that iterates over each column and use buildin
.all()
to check the comparison condition. Collect all indices that should be deleted in list (order does not matter). - Drop the columns in the full dataset matrix
X
based on the indices list.
On a [email protected] machine this takes 42 minutes on my dataset. I feel that especially the .all() conditional check in .get_filtered_cols
should be optimized. Any other recommendations are certainly welcome.
Code with smaller simulated dataset:
import numpy as np
from scipy.sparse import csr_matrix
import multiprocessing
# Initiate simulated random sparse csr matrix as dataset X. Actual use case is 78k x 490k.
N = 780; M = 4900
X = np.random.choice([0, 1, 2, 3, 4], size=(N,M), p=[0.99, 0.005, 0.0025, 0.0015, 0.001]) # this is a rough
# simulation of the type of data in the use case (of course upperbound of some features is much higher)
X = csr_matrix(X, dtype=np.float32) # the real-use svmlight dataset can only be loaded as sparse.csr_matrix
# The settings of the feature groups to be filtered. Contains the range of the feature group in the dataset and the
# threshold value.
ngram_fg_dict = {"featuregroup_01": {"threshold": 3, "start_idx": 0, "end_idx": 2450},
"featuregroup_02": {"threshold": 4, "start_idx": 2451, "end_idx": 4900}}
n_cores = 3
def list_split(lst, n):
'''Split a list into roughly equal n groups'''
k, m = int(len(lst) / n), int(len(lst) % n)
return [lst[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in list(range(n))]
def get_filtered_cols(indices):
'''Takes a list of column indices of the dataset to check if all column values are smaller than k'''
col_idc_delete = []
for i in indices:
col = fg_X[:,i].toarray().flatten()
if all(i < v["threshold"] for i in col):
col_idc_delete.append(i+v["start_idx"]) #these are the indices for the original dataset (not yet sliced)
return col_idc_delete
def drop_cols(M, idx_to_drop):
'''Remove columns from matrix M given a list of column indices to remove.'''
idx_to_drop = np.unique(idx_to_drop)
C = M.tocoo()
keep = ~np.in1d(C.col, idx_to_drop)
C.data, C.row, C.col = C.data[keep], C.row[keep], C.col[keep]
C.col -= idx_to_drop.searchsorted(C.col) # decrement column indices
C._shape = (C.shape[0], C.shape[1] - len(idx_to_drop))
return C.tocsr()
all_idc_delete = []
for k, v in ngram_fg_dict.items():
if v["threshold"] > 1: # '1' in threshold_dict means 'set no threshold' given our dataset
fg_X = X[:,v["start_idx"]:v["end_idx"]] # slice feature group to be filtered
l = fg_X.shape[1] # total amount of columns
# split the feature column indices list in groups for multiprocessing, the conditional check is to remove
# potential empty lists resulting from list_split
mp_groups = [lgroup for lgroup in list_split(list(range(l)), n_cores) if lgroup != []]
p = multiprocessing.Pool(len(mp_groups))
print("Filtering %s < %d with %d processes" % (k, v["threshold"], len(mp_groups)))
fg_idc_delete = p.imap(get_filtered_cols, mp_groups) #imap faster than map, order of returned result column
# indices does not matter
all_idc_delete.extend([item for sublist in fg_idc_delete for item in sublist]) #flatten before extending to
# all indices to delete list
print("Deleting %s columns." % (len(all_idc_delete)))
X_new = drop_cols(X, all_idc_delete)
Benchmarking this 30x: time avg: 2.67 sec, best: 2.41 sec. on my local machine.
1 Answer 1
Assuming that the threshold is positive, then you can use the >=
operator to construct a sparse Boolean array indicating which points are above or equal to the threshold:
# m is your dataset in sparse matrix representation
above_threshold = m >= v["threshold"]
and then you can use the max
method to get the maximum entry in each column:
cols = above_threshold.max(axis=0)
This will be 1 for columns that have any value greater than or equal to the threshold, and 0 for columns where all values are below the threshold. So cols
is a mask for the columns you want to keep. (If you need a Boolean array, then use cols == 1
.)
(Updated after discussion in comments. I had some more complicated suggestions, but simpler is better.)
-
\$\begingroup\$ Thanks. The speed advantage of converting to csc is marginal though as we only have to slice along the column for each feature group (max 4). I benchmarked with csc and average time improvement is 3.2%, best time improvement is 8.5%. \$\endgroup\$GJacobs– GJacobs2016年08月16日 15:36:46 +00:00Commented Aug 16, 2016 at 15:36
-
1\$\begingroup\$ @GJacobs: did you re-implement the column filtering operation to take advantage of the CSC format? \$\endgroup\$Gareth Rees– Gareth Rees2016年08月16日 15:44:28 +00:00Commented Aug 16, 2016 at 15:44
-
\$\begingroup\$ No, because -to my knowledge- there is no way to perform a boolean mask or
.all()
-like function directly on a sparse matrix. So I end up iterating over columns in the exact same way as currently implemented. A way to perform the conditional check in.all()
efficiently without iterating over every column (either over the full matrix or in batches) would solve this problem. \$\endgroup\$GJacobs– GJacobs2016年08月16日 16:04:04 +00:00Commented Aug 16, 2016 at 16:04 -
1\$\begingroup\$ @GJacobs: see updated post. \$\endgroup\$Gareth Rees– Gareth Rees2016年08月16日 16:15:22 +00:00Commented Aug 16, 2016 at 16:15
-
1\$\begingroup\$ Don't. If you want you can put it as an answer, but do not put it in the original, as it will make it very hard to understand what happened. \$\endgroup\$Oscar Smith– Oscar Smith2016年08月16日 18:28:41 +00:00Commented Aug 16, 2016 at 18:28
Explore related questions
See similar questions with these tags.