Note: If your data really is of shape (large_number, 4), this technique will just exacerbate the problem. This answer assumed that numpy.where(folded == genes[:, newaxis])
was relatively cheap - this is why I ask for example data.
It's likely that this is faster for square-ish arrays since it avoids loops in Pythonseems to be the fastest so far:
import numpy
from numpy import newaxis
def avg_dups(genes, values):
folded = numpy.unique(genes)
_, indices, counts = numpynp.whereunique(folded ==genes, genes[:return_inverse=True, newaxis]return_counts=True)
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
output /= (folded == genes[:, newaxis]).sum(axis=0)[counts[:, numpy.newaxis]
return folded, output
This finds the unique genes to fold the values into:
folded = numpy.unique(genes)
and then finds, along with the current index → new index
mapping and the number of repeated values that map to the same index:
_folded, indices, counts = numpynp.whereunique(folded ==genes, genes[:return_inverse=True, newaxis]return_counts=True)
It adds the row from each current index to the new index in the new output
:
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
numpy.add.at(output, indices, values)
is used over output[indicies]output[indices] += values
because the buffering used in +=
breaks the code for repeated indiciesindices.
The mean is taken with a simple division of the number of repeated values that map to the same index:
output /= (folded == genes[:, newaxis]).sum(axis=0)[counts[:, numpy.newaxis]
Please do time and report back.Using Ashwini Chaudhary's generate_test_data(2000)
(giving a 10000x4 array), my rough timings are:
name time/ms Author
avg_dups 230 gwg
avg_dups_fast 33 Ashwini Chaudhary
avg_dups_python 45 Ashwini Chaudhary
avg_dups 430 Veedrac
avg_dups 5 Veedrac with Jaime's improvement
Note: If your data really is of shape (large_number, 4), this technique will just exacerbate the problem. This answer assumed that numpy.where(folded == genes[:, newaxis])
was relatively cheap - this is why I ask for example data.
It's likely that this is faster for square-ish arrays since it avoids loops in Python:
def avg_dups(genes, values):
folded = numpy.unique(genes)
_, indices = numpy.where(folded == genes[:, newaxis])
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
return folded, output
This finds the unique genes to fold the values into:
folded = numpy.unique(genes)
and then finds the current index → new index
mapping:
_, indices = numpy.where(folded == genes[:, newaxis])
It adds the row from each current index to the new index in the new output
:
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
numpy.add.at(output, indices, values)
is used over output[indicies] += values
because the buffering used in +=
breaks the code for repeated indicies.
The mean is taken with a simple division of the number of repeated values that map to the same index:
output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
Please do time and report back.
This seems to be the fastest so far:
import numpy
from numpy import newaxis
def avg_dups(genes, values):
folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
output /= counts[:, newaxis]
return folded, output
This finds the unique genes to fold the values into, along with the current index → new index
mapping and the number of repeated values that map to the same index:
folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
It adds the row from each current index to the new index in the new output
:
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
numpy.add.at(output, indices, values)
is used over output[indices] += values
because the buffering used in +=
breaks the code for repeated indices.
The mean is taken with a simple division of the number of repeated values that map to the same index:
output /= counts[:, newaxis]
Using Ashwini Chaudhary's generate_test_data(2000)
(giving a 10000x4 array), my rough timings are:
name time/ms Author
avg_dups 230 gwg
avg_dups_fast 33 Ashwini Chaudhary
avg_dups_python 45 Ashwini Chaudhary
avg_dups 430 Veedrac
avg_dups 5 Veedrac with Jaime's improvement
Note: If your data really is of shape (large_number, 4), this technique will just exacerbate the problem. This answer assumed that numpy.where(folded == genes[:, newaxis])
was relatively cheap - this is why I ask for example data.
It's likely that this is faster for square-ish arrays since it avoids loops in Python:
def avg_dups(genes, values):
folded = numpy.unique(genes)
_, indiciesindices = numpy.where(folded == genes[:, newaxis])
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indiciesindices, values)
output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
return folded, output
This finds the unique genes to fold the values into:
folded = numpy.unique(genes)
and then finds the current index → new index
mapping:
_, indiciesindices = numpy.where(folded == genes[:, newaxis])
It adds the row from each current index to the new index in the new output
:
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indiciesindices, values)
numpy.add.at(output, indiciesindices, values)
is used over output[indicies] += values
because the buffering used in +=
breaks the code for repeated indicies.
The mean is taken with a simple division of the number of repeated values that map to the same index:
output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
Please do time and report back.
It's likely that this is faster since it avoids loops in Python:
def avg_dups(genes, values):
folded = numpy.unique(genes)
_, indicies = numpy.where(folded == genes[:, newaxis])
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indicies, values)
output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
return folded, output
This finds the unique genes to fold the values into:
folded = numpy.unique(genes)
and then finds the current index → new index
mapping:
_, indicies = numpy.where(folded == genes[:, newaxis])
It adds the row from each current index to the new index in the new output
:
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indicies, values)
numpy.add.at(output, indicies, values)
is used over output[indicies] += values
because the buffering used in +=
breaks the code for repeated indicies.
The mean is taken with a simple division of the number of repeated values that map to the same index:
output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
Please do time and report back.
Note: If your data really is of shape (large_number, 4), this technique will just exacerbate the problem. This answer assumed that numpy.where(folded == genes[:, newaxis])
was relatively cheap - this is why I ask for example data.
It's likely that this is faster for square-ish arrays since it avoids loops in Python:
def avg_dups(genes, values):
folded = numpy.unique(genes)
_, indices = numpy.where(folded == genes[:, newaxis])
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
return folded, output
This finds the unique genes to fold the values into:
folded = numpy.unique(genes)
and then finds the current index → new index
mapping:
_, indices = numpy.where(folded == genes[:, newaxis])
It adds the row from each current index to the new index in the new output
:
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
numpy.add.at(output, indices, values)
is used over output[indicies] += values
because the buffering used in +=
breaks the code for repeated indicies.
The mean is taken with a simple division of the number of repeated values that map to the same index:
output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
Please do time and report back.
It's likely that this is faster since it avoids loops in Python:
def avg_dups(genes, values):
folded = numpy.unique(genes)
_, indicies = numpy.where(folded == genes[:, newaxis])
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indicies, values)
output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
return folded, output
This finds the unique genes to fold the values into:
folded = numpy.unique(genes)
and then finds the current index → new index
mapping:
_, indicies = numpy.where(folded == genes[:, newaxis])
It adds the row from each current index to the new index in the new output
:
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indicies, values)
numpy.add.at(output, indicies, values)
is used over output[indicies] += values
because the buffering used in +=
breaks the code for repeated indicies.
The mean is taken with a simple division of the number of repeated values that map to the same index:
output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
Please do time and report back.