Note: If your data really is of shape (large_number, 4), this technique will just exacerbate the problem. This answer assumed that numpy.where(folded == genes[:, newaxis])
was relatively cheap - this is why I ask for example data.
It's likely that this is faster for square-ish arrays since it avoids loops in Python:
def avg_dups(genes, values):
folded = numpy.unique(genes)
_, indices = numpy.where(folded == genes[:, newaxis])
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
return folded, output
This finds the unique genes to fold the values into:
folded = numpy.unique(genes)
and then finds the current index → new index
mapping:
_, indices = numpy.where(folded == genes[:, newaxis])
It adds the row from each current index to the new index in the new output
:
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
numpy.add.at(output, indices, values)
is used over output[indicies] += values
because the buffering used in +=
breaks the code for repeated indicies.
The mean is taken with a simple division of the number of repeated values that map to the same index:
output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
Please do time and report back.
- 9.8k
- 23
- 38