Return to Revisions

2 of 3

added 260 characters in body

edited Feb 20, 2015 at 1:18

9.8k
23
38

Note: If your data really is of shape (large_number, 4), this technique will just exacerbate the problem. This answer assumed that numpy.where(folded == genes[:, newaxis]) was relatively cheap - this is why I ask for example data.

It's likely that this is faster for square-ish arrays since it avoids loops in Python:

def avg_dups(genes, values):
 folded = numpy.unique(genes)
 _, indices = numpy.where(folded == genes[:, newaxis])
 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)
 output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
 return folded, output

This finds the unique genes to fold the values into:

 folded = numpy.unique(genes)

and then finds the current index → new index mapping:

 _, indices = numpy.where(folded == genes[:, newaxis])

It adds the row from each current index to the new index in the new output:

 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)

numpy.add.at(output, indices, values) is used over output[indicies] += values because the buffering used in += breaks the code for repeated indicies.

The mean is taken with a simple division of the number of repeated values that map to the same index:

 output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]

Please do time and report back.

answered Feb 20, 2015 at 0:56

Veedrac

9.8k
23
38

default