Return to Answer

Using Jaime's amazing fix.

edited Feb 20, 2015 at 17:12

9.8k
23
38

Note: If your data really is of shape (large_number, 4), this technique will just exacerbate the problem. This answer assumed that numpy.where(folded == genes[:, newaxis]) was relatively cheap - this is why I ask for example data.

It's likely that this is faster for square-ish arrays since it avoids loops in Pythonseems to be the fastest so far:

import numpy
from numpy import newaxis
def avg_dups(genes, values):
 folded = numpy.unique(genes)
 _, indices, counts = numpynp.whereunique(folded ==genes, genes[:return_inverse=True, newaxis]return_counts=True)
 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)
 output /= (folded == genes[:, newaxis]).sum(axis=0)[counts[:, numpy.newaxis]
 return folded, output

This finds the unique genes to fold the values into:

 folded = numpy.unique(genes)

and then finds, along with the current index → new index mapping and the number of repeated values that map to the same index:

 _folded, indices, counts = numpynp.whereunique(folded ==genes, genes[:return_inverse=True, newaxis]return_counts=True)

It adds the row from each current index to the new index in the new output:

 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)

numpy.add.at(output, indices, values) is used over output[indicies]output[indices] += values because the buffering used in += breaks the code for repeated indiciesindices.

The mean is taken with a simple division of the number of repeated values that map to the same index:

 output /= (folded == genes[:, newaxis]).sum(axis=0)[counts[:, numpy.newaxis]

Please do time and report back.Using Ashwini Chaudhary's generate_test_data(2000) (giving a 10000x4 array), my rough timings are:

name time/ms Author
avg_dups 230 gwg
avg_dups_fast 33 Ashwini Chaudhary
avg_dups_python 45 Ashwini Chaudhary
avg_dups 430 Veedrac
avg_dups 5 Veedrac with Jaime's improvement

It's likely that this is faster for square-ish arrays since it avoids loops in Python:

def avg_dups(genes, values):
 folded = numpy.unique(genes)
 _, indices = numpy.where(folded == genes[:, newaxis])
 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)
 output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
 return folded, output

This finds the unique genes to fold the values into:

 folded = numpy.unique(genes)

and then finds the current index → new index mapping:

 _, indices = numpy.where(folded == genes[:, newaxis])

It adds the row from each current index to the new index in the new output:

 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)

numpy.add.at(output, indices, values) is used over output[indicies] += values because the buffering used in += breaks the code for repeated indicies.

The mean is taken with a simple division of the number of repeated values that map to the same index:

 output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]

Please do time and report back.

This seems to be the fastest so far:

import numpy
from numpy import newaxis
def avg_dups(genes, values):
 folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)
 output /= counts[:, newaxis]
 return folded, output

This finds the unique genes to fold the values into, along with the current index → new index mapping and the number of repeated values that map to the same index:

 folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)

It adds the row from each current index to the new index in the new output:

 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)

numpy.add.at(output, indices, values) is used over output[indices] += values because the buffering used in += breaks the code for repeated indices.

The mean is taken with a simple division of the number of repeated values that map to the same index:

 output /= counts[:, newaxis]

Using Ashwini Chaudhary's generate_test_data(2000) (giving a 10000x4 array), my rough timings are:

name time/ms Author
avg_dups 230 gwg
avg_dups_fast 33 Ashwini Chaudhary
avg_dups_python 45 Ashwini Chaudhary
avg_dups 430 Veedrac
avg_dups 5 Veedrac with Jaime's improvement

added 260 characters in body

Source Link

edited Feb 20, 2015 at 1:18

Veedrac

edited Feb 20, 2015 at 1:18

Veedrac

9.8k
23
38

It's likely that this is faster for square-ish arrays since it avoids loops in Python:

def avg_dups(genes, values):
 folded = numpy.unique(genes)
 _, indiciesindices = numpy.where(folded == genes[:, newaxis])
 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indiciesindices, values)
 output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
 return folded, output

This finds the unique genes to fold the values into:

 folded = numpy.unique(genes)

and then finds the current index → new index mapping:

 _, indiciesindices = numpy.where(folded == genes[:, newaxis])

It adds the row from each current index to the new index in the new output:

 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indiciesindices, values)

numpy.add.at(output, indiciesindices, values) is used over output[indicies] += values because the buffering used in += breaks the code for repeated indicies.

The mean is taken with a simple division of the number of repeated values that map to the same index:

 output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]

Please do time and report back.

It's likely that this is faster since it avoids loops in Python:

def avg_dups(genes, values):
 folded = numpy.unique(genes)
 _, indicies = numpy.where(folded == genes[:, newaxis])
 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indicies, values)
 output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
 return folded, output

This finds the unique genes to fold the values into:

 folded = numpy.unique(genes)

and then finds the current index → new index mapping:

 _, indicies = numpy.where(folded == genes[:, newaxis])

It adds the row from each current index to the new index in the new output:

 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indicies, values)

numpy.add.at(output, indicies, values) is used over output[indicies] += values because the buffering used in += breaks the code for repeated indicies.

The mean is taken with a simple division of the number of repeated values that map to the same index:

 output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]

Please do time and report back.

It's likely that this is faster for square-ish arrays since it avoids loops in Python:

def avg_dups(genes, values):
 folded = numpy.unique(genes)
 _, indices = numpy.where(folded == genes[:, newaxis])
 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)
 output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
 return folded, output

This finds the unique genes to fold the values into:

 folded = numpy.unique(genes)

and then finds the current index → new index mapping:

 _, indices = numpy.where(folded == genes[:, newaxis])

It adds the row from each current index to the new index in the new output:

 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)

numpy.add.at(output, indices, values) is used over output[indicies] += values because the buffering used in += breaks the code for repeated indicies.

The mean is taken with a simple division of the number of repeated values that map to the same index:

 output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]

Please do time and report back.

Source Link

answered Feb 20, 2015 at 0:56

Veedrac

answered Feb 20, 2015 at 0:56

Veedrac

9.8k
23
38

It's likely that this is faster since it avoids loops in Python:

def avg_dups(genes, values):
 folded = numpy.unique(genes)
 _, indicies = numpy.where(folded == genes[:, newaxis])
 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indicies, values)
 output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]
 return folded, output

This finds the unique genes to fold the values into:

 folded = numpy.unique(genes)

and then finds the current index → new index mapping:

 _, indicies = numpy.where(folded == genes[:, newaxis])

It adds the row from each current index to the new index in the new output:

 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indicies, values)

numpy.add.at(output, indicies, values) is used over output[indicies] += values because the buffering used in += breaks the code for repeated indicies.

The mean is taken with a simple division of the number of repeated values that map to the same index:

 output /= (folded == genes[:, newaxis]).sum(axis=0)[:, numpy.newaxis]

Please do time and report back.

lang-py