I have gene expression data that I represent as a list of genes and a list of lists of values. I average the expression data for any genes with the same name.
For example:
genes = ['A', 'C', 'C', 'B', 'A']
vals = [[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4
[3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2
[8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1
[4.0, 4.0, 4.0, 3.0], # B: is fine
[1.0, 1.0, 1.0, 1.0]] # A: will be averaged with row=0
is converted to
genes = ['A', 'B', 'C']
vals = [[1.5, 1.5, 5.0, 5.0],
[4.0, 4.0, 4.0, 3.0],
[5.5, 5.5, 2.5, 2.5]]
Here is my function:
def avg_dups(genes, values):
"""Finds duplicate genes and averages their expression data.
"""
unq_genes = np.unique(genes)
out_values = np.zeros((unq_genes.shape[0], values.shape[1]))
for i, gene in enumerate(unq_genes):
dups = values[genes==gene]
out_values[i] = np.mean(dups, axis=0)
return (unq_genes, out_values)
This function is slower than any other part of my data pipeline, taking 5-10 seconds when other steps that also operate on the whole dataset take sub-seconds. Any thoughts on how I can improve this?
2 Answers 2
This seems to be the fastest so far:
import numpy
from numpy import newaxis
def avg_dups(genes, values):
folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
output /= counts[:, newaxis]
return folded, output
This finds the unique genes to fold the values into, along with the current index → new index
mapping and the number of repeated values that map to the same index:
folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
It adds the row from each current index to the new index in the new output
:
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
numpy.add.at(output, indices, values)
is used over output[indices] += values
because the buffering used in +=
breaks the code for repeated indices.
The mean is taken with a simple division of the number of repeated values that map to the same index:
output /= counts[:, newaxis]
Using Ashwini Chaudhary's generate_test_data(2000)
(giving a 10000x4 array), my rough timings are:
name time/ms Author
avg_dups 230 gwg
avg_dups_fast 33 Ashwini Chaudhary
avg_dups_python 45 Ashwini Chaudhary
avg_dups 430 Veedrac
avg_dups 5 Veedrac with Jaime's improvement
-
2\$\begingroup\$ You can do
folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
and spare yourself all those broadcasting comparisons, which typically cripple performance. On the other hand, nice use ofadd.at
! \$\endgroup\$Jaime– Jaime2015年02月20日 08:50:05 +00:00Commented Feb 20, 2015 at 8:50 -
\$\begingroup\$ @Jaime Blimey, that's fast indeed. +1 \$\endgroup\$Veedrac– Veedrac2015年02月20日 16:58:15 +00:00Commented Feb 20, 2015 at 16:58
-
\$\begingroup\$ @Veedrac, this is very fast. Thanks! Averaging duplicates in my test SOFT file (originally picked at random) using my original function took
5.22903513908
seconds. Your function took0.000233888626099
seconds. Thanks! \$\endgroup\$jds– jds2015年02月24日 15:54:37 +00:00Commented Feb 24, 2015 at 15:54
Your current approach is slow because it takes \$\mathcal{O}(n^2)\$ time, for each unique gene you're checking for its index in genes
.
One not-pure NumPy approach that I can think of will take \$\mathcal{O}(n \log n)\$ time for this (explanation in comments).
from collections import defaultdict
from itertools import count
import numpy as np
def avg_dups_fast(genes, values):
# Find the sorted indices of all genes so that we can group them together
sorted_indices = np.argsort(genes)
# Now create two arrays using `sorted_indices` where similar genes and
# the corresponding values are now grouped together
sorted_genes = genes[sorted_indices]
sorted_values = values[sorted_indices]
# Now to find each individual group we need to find the index where the
# gene value changes. We can use `numpy.where` with `numpy.diff` for this.
# But as numpy.diff won't work with string, so we need to generate
# some unique integers for genes, for that we can use
# collections.defaultdict with itertools.count.
# This dict will generate a new integer as soon as a
# new string is encountered and will save it as well so that same
# value is used for repeated strings.
d = defaultdict(count(0).next)
unique_ints = np.fromiter((d[x] for x in sorted_genes), dtype=int)
# Now get the indices
split_at = np.where(np.diff(unique_ints)!=0)[0] + 1
# split the `sorted_values` at those indices.
split_items = np.array_split(sorted_values, split_at)
return np.unique(sorted_genes), np.array([np.mean(arr, axis=0) for arr in split_items])
Also a Pure Python approach that will take only \$\mathcal{O}(n)\$ time. Here I've simply used a dictionary with the gene as key and the corresponding values are going to be appended in a list:
from collections import defaultdict
from itertools import izip
def avg_dups_python(genes, values):
d = defaultdict(list)
for k, v in izip(genes, values):
d[k].append(v)
return list(d), [np.mean(val, axis=0) for val in d.itervalues()]
Timing comparisons:
>>> from string import ascii_letters
>>> from itertools import islice, product
>>> def generate_test_data(n):
genes = np.array([''.join(x) for x in islice(product(ascii_letters, repeat=3), n)]*5, dtype='S3')
np.random.shuffle(genes)
vals = np.array([[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4
[3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2
[8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1
[4.0, 4.0, 4.0, 3.0], # B: is fine
[1.0, 1.0, 1.0, 1.0]]*n) # A: will be averaged with row=0
return genes, vals
...
>>> data = generate_test_data(20000)
>>> %timeit avg_dups(*data)
1 loops, best of 3: 18.4 s per loop
>>> %timeit avg_dups_fast(*data)
10 loops, best of 3: 166 ms per loop
>>> %timeit avg_dups_python(*data)
1 loops, best of 3: 253 ms per loop
>>> (avg_dups(*data)[1] == avg_dups_fast(*data)[1]).all()
True
-
1\$\begingroup\$ "we need to generate some unique integers for genes" → If they need to be unique, you should avoid
hash
- it does not guarantee uniqueness. \$\endgroup\$Veedrac– Veedrac2015年02月20日 01:23:22 +00:00Commented Feb 20, 2015 at 1:23 -
\$\begingroup\$ @Veedrac Interesting! So, two different strings can hash to same value in a given process(considering strings have their own
__hash__
method)?(Python 2.7) A simple alternative will be to used = defaultdict(count(0).next)
, this will return a unique value each time a new string is encountered. \$\endgroup\$Ashwini Chaudhary– Ashwini Chaudhary2015年02月20日 02:42:33 +00:00Commented Feb 20, 2015 at 2:42 -
\$\begingroup\$ Yes, the whole point of hashing a string is that the hash is short. Loss of information is inevitable, hence collisions. \$\endgroup\$Janne Karila– Janne Karila2015年02月20日 06:39:14 +00:00Commented Feb 20, 2015 at 6:39
-
\$\begingroup\$ @JanneKarila You're right, not sure what I was thinking. :/ \$\endgroup\$Ashwini Chaudhary– Ashwini Chaudhary2015年02月20日 11:38:14 +00:00Commented Feb 20, 2015 at 11:38
-
\$\begingroup\$ @JanneKarila: Hash collisions don't take place due to "loss of information", but due to the fact that we do not have a perfect hash function --i.e. a hash function that guarantees that no two inputs will ever be mapped/hashed to the same output. \$\endgroup\$code_dredd– code_dredd2015年10月08日 01:43:54 +00:00Commented Oct 8, 2015 at 1:43
ILMN...
, needs to be converted to gene symbols, of which there are only 20,000[1], so I suspect there will be quite a few duplicates. [1]There are many more than 20,000 gene symbols, but I convert them to a subset for humans. \$\endgroup\$