4
\$\begingroup\$

I have gene expression data that I represent as a list of genes and a list of lists of values. I average the expression data for any genes with the same name.

For example:

genes = ['A', 'C', 'C', 'B', 'A']
vals = [[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4
 [3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2
 [8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1
 [4.0, 4.0, 4.0, 3.0], # B: is fine
 [1.0, 1.0, 1.0, 1.0]] # A: will be averaged with row=0

is converted to

genes = ['A', 'B', 'C']
vals = [[1.5, 1.5, 5.0, 5.0],
 [4.0, 4.0, 4.0, 3.0],
 [5.5, 5.5, 2.5, 2.5]]

Here is my function:

def avg_dups(genes, values):
 """Finds duplicate genes and averages their expression data.
 """
 unq_genes = np.unique(genes)
 out_values = np.zeros((unq_genes.shape[0], values.shape[1]))
 for i, gene in enumerate(unq_genes):
 dups = values[genes==gene]
 out_values[i] = np.mean(dups, axis=0)
 return (unq_genes, out_values)

This function is slower than any other part of my data pipeline, taking 5-10 seconds when other steps that also operate on the whole dataset take sub-seconds. Any thoughts on how I can improve this?

asked Feb 19, 2015 at 23:45
\$\endgroup\$
7
  • 1
    \$\begingroup\$ On an average about how many unique items can be present in genes? \$\endgroup\$ Commented Feb 20, 2015 at 0:29
  • \$\begingroup\$ @AshwiniChaudhary, ~20,000. \$\endgroup\$ Commented Feb 20, 2015 at 0:34
  • \$\begingroup\$ Can you give realistically sized and distributed example data? It helps a ton when trying to optimize. \$\endgroup\$ Commented Feb 20, 2015 at 0:43
  • \$\begingroup\$ @Veedrac, that's hard to answer because there is a lot of variance in the data. Here is an example dataset: ftp.ncbi.nlm.nih.gov/geo/series/GSE43nnn/GSE43805/soft. The leftmost column, ILMN..., needs to be converted to gene symbols, of which there are only 20,000[1], so I suspect there will be quite a few duplicates. [1]There are many more than 20,000 gene symbols, but I convert them to a subset for humans. \$\endgroup\$ Commented Feb 20, 2015 at 1:03
  • 1
    \$\begingroup\$ I was trying to indicate that the shape and cleanliness of the data can vary significantly. I have seen both 10KB and 20GB files, shapes ranging from (400000,3) to (1000,20), and duplicates ranging from 0 to >75%. \$\endgroup\$ Commented Feb 20, 2015 at 1:33

2 Answers 2

3
\$\begingroup\$

This seems to be the fastest so far:

import numpy
from numpy import newaxis
def avg_dups(genes, values):
 folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)
 output /= counts[:, newaxis]
 return folded, output

This finds the unique genes to fold the values into, along with the current index → new index mapping and the number of repeated values that map to the same index:

 folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)

It adds the row from each current index to the new index in the new output:

 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)

numpy.add.at(output, indices, values) is used over output[indices] += values because the buffering used in += breaks the code for repeated indices.

The mean is taken with a simple division of the number of repeated values that map to the same index:

 output /= counts[:, newaxis]

Using Ashwini Chaudhary's generate_test_data(2000) (giving a 10000x4 array), my rough timings are:

name time/ms Author
avg_dups 230 gwg
avg_dups_fast 33 Ashwini Chaudhary
avg_dups_python 45 Ashwini Chaudhary
avg_dups 430 Veedrac
avg_dups 5 Veedrac with Jaime's improvement
answered Feb 20, 2015 at 0:56
\$\endgroup\$
3
  • 2
    \$\begingroup\$ You can do folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True) and spare yourself all those broadcasting comparisons, which typically cripple performance. On the other hand, nice use of add.at! \$\endgroup\$ Commented Feb 20, 2015 at 8:50
  • \$\begingroup\$ @Jaime Blimey, that's fast indeed. +1 \$\endgroup\$ Commented Feb 20, 2015 at 16:58
  • \$\begingroup\$ @Veedrac, this is very fast. Thanks! Averaging duplicates in my test SOFT file (originally picked at random) using my original function took 5.22903513908 seconds. Your function took 0.000233888626099 seconds. Thanks! \$\endgroup\$ Commented Feb 24, 2015 at 15:54
2
\$\begingroup\$

Your current approach is slow because it takes \$\mathcal{O}(n^2)\$ time, for each unique gene you're checking for its index in genes.

One not-pure NumPy approach that I can think of will take \$\mathcal{O}(n \log n)\$ time for this (explanation in comments).

from collections import defaultdict
from itertools import count
import numpy as np
def avg_dups_fast(genes, values):
 # Find the sorted indices of all genes so that we can group them together
 sorted_indices = np.argsort(genes)
 # Now create two arrays using `sorted_indices` where similar genes and
 # the corresponding values are now grouped together
 sorted_genes = genes[sorted_indices]
 sorted_values = values[sorted_indices]
 # Now to find each individual group we need to find the index where the
 # gene value changes. We can use `numpy.where` with `numpy.diff` for this.
 # But as numpy.diff won't work with string, so we need to generate
 # some unique integers for genes, for that we can use
 # collections.defaultdict with itertools.count. 
 # This dict will generate a new integer as soon as a
 # new string is encountered and will save it as well so that same
 # value is used for repeated strings. 
 d = defaultdict(count(0).next)
 unique_ints = np.fromiter((d[x] for x in sorted_genes), dtype=int)
 # Now get the indices
 split_at = np.where(np.diff(unique_ints)!=0)[0] + 1
 # split the `sorted_values` at those indices.
 split_items = np.array_split(sorted_values, split_at)
 return np.unique(sorted_genes), np.array([np.mean(arr, axis=0) for arr in split_items])

Also a Pure Python approach that will take only \$\mathcal{O}(n)\$ time. Here I've simply used a dictionary with the gene as key and the corresponding values are going to be appended in a list:

from collections import defaultdict
from itertools import izip
def avg_dups_python(genes, values):
 d = defaultdict(list)
 for k, v in izip(genes, values):
 d[k].append(v)
 return list(d), [np.mean(val, axis=0) for val in d.itervalues()] 

Timing comparisons:

>>> from string import ascii_letters
>>> from itertools import islice, product
>>> def generate_test_data(n):
 genes = np.array([''.join(x) for x in islice(product(ascii_letters, repeat=3), n)]*5, dtype='S3')
 np.random.shuffle(genes)
 vals = np.array([[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4
 [3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2
 [8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1
 [4.0, 4.0, 4.0, 3.0], # B: is fine
 [1.0, 1.0, 1.0, 1.0]]*n) # A: will be averaged with row=0
 return genes, vals
... 
>>> data = generate_test_data(20000)
>>> %timeit avg_dups(*data)
1 loops, best of 3: 18.4 s per loop
>>> %timeit avg_dups_fast(*data)
10 loops, best of 3: 166 ms per loop
>>> %timeit avg_dups_python(*data)
1 loops, best of 3: 253 ms per loop
>>> (avg_dups(*data)[1] == avg_dups_fast(*data)[1]).all()
True
Veedrac
9,77323 silver badges38 bronze badges
answered Feb 20, 2015 at 0:57
\$\endgroup\$
8
  • 1
    \$\begingroup\$ "we need to generate some unique integers for genes" → If they need to be unique, you should avoid hash - it does not guarantee uniqueness. \$\endgroup\$ Commented Feb 20, 2015 at 1:23
  • \$\begingroup\$ @Veedrac Interesting! So, two different strings can hash to same value in a given process(considering strings have their own __hash__ method)?(Python 2.7) A simple alternative will be to use d = defaultdict(count(0).next), this will return a unique value each time a new string is encountered. \$\endgroup\$ Commented Feb 20, 2015 at 2:42
  • \$\begingroup\$ Yes, the whole point of hashing a string is that the hash is short. Loss of information is inevitable, hence collisions. \$\endgroup\$ Commented Feb 20, 2015 at 6:39
  • \$\begingroup\$ @JanneKarila You're right, not sure what I was thinking. :/ \$\endgroup\$ Commented Feb 20, 2015 at 11:38
  • \$\begingroup\$ @JanneKarila: Hash collisions don't take place due to "loss of information", but due to the fact that we do not have a perfect hash function --i.e. a hash function that guarantees that no two inputs will ever be mapped/hashed to the same output. \$\endgroup\$ Commented Oct 8, 2015 at 1:43

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.