Averaging lists of values with duplicate keys

Question 1

I have gene expression data that I represent as a list of genes and a list of lists of values. I average the expression data for any genes with the same name.

For example:

genes = ['A', 'C', 'C', 'B', 'A']
vals = [[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4
 [3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2
 [8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1
 [4.0, 4.0, 4.0, 3.0], # B: is fine
 [1.0, 1.0, 1.0, 1.0]] # A: will be averaged with row=0

is converted to

genes = ['A', 'B', 'C']
vals = [[1.5, 1.5, 5.0, 5.0],
 [4.0, 4.0, 4.0, 3.0],
 [5.5, 5.5, 2.5, 2.5]]

Here is my function:

def avg_dups(genes, values):
 """Finds duplicate genes and averages their expression data.
 """
 unq_genes = np.unique(genes)
 out_values = np.zeros((unq_genes.shape[0], values.shape[1]))
 for i, gene in enumerate(unq_genes):
 dups = values[genes==gene]
 out_values[i] = np.mean(dups, axis=0)
 return (unq_genes, out_values)

This function is slower than any other part of my data pipeline, taking 5-10 seconds when other steps that also operate on the whole dataset take sub-seconds. Any thoughts on how I can improve this?

Question 2

On an average about how many unique items can be present in genes?

Question 3

@AshwiniChaudhary, ~20,000.

Question 4

Can you give realistically sized and distributed example data? It helps a ton when trying to optimize.

Question 5

@Veedrac, that's hard to answer because there is a lot of variance in the data. Here is an example dataset: ftp.ncbi.nlm.nih.gov/geo/series/GSE43nnn/GSE43805/soft. The leftmost column, ILMN..., needs to be converted to gene symbols, of which there are only 20,000[1], so I suspect there will be quite a few duplicates. [1]There are many more than 20,000 gene symbols, but I convert them to a subset for humans.

Question 6

I was trying to indicate that the shape and cleanliness of the data can vary significantly. I have seen both 10KB and 20GB files, shapes ranging from (400000,3) to (1000,20), and duplicates ranging from 0 to >75%.

Question 7

This seems to be the fastest so far:

import numpy
from numpy import newaxis
def avg_dups(genes, values):
 folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)
 output /= counts[:, newaxis]
 return folded, output

This finds the unique genes to fold the values into, along with the current index → new index mapping and the number of repeated values that map to the same index:

 folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)

It adds the row from each current index to the new index in the new output:

 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)

numpy.add.at(output, indices, values) is used over output[indices] += values because the buffering used in += breaks the code for repeated indices.

The mean is taken with a simple division of the number of repeated values that map to the same index:

 output /= counts[:, newaxis]

Using Ashwini Chaudhary's generate_test_data(2000) (giving a 10000x4 array), my rough timings are:

name time/ms Author
avg_dups 230 gwg
avg_dups_fast 33 Ashwini Chaudhary
avg_dups_python 45 Ashwini Chaudhary
avg_dups 430 Veedrac
avg_dups 5 Veedrac with Jaime's improvement

Question 8

You can do folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True) and spare yourself all those broadcasting comparisons, which typically cripple performance. On the other hand, nice use of add.at!

Question 9

@Jaime Blimey, that's fast indeed. +1

Question 10

@Veedrac, this is very fast. Thanks! Averaging duplicates in my test SOFT file (originally picked at random) using my original function took 5.22903513908 seconds. Your function took 0.000233888626099 seconds. Thanks!

Question 11

Your current approach is slow because it takes \$\mathcal{O}(n^2)\$ time, for each unique gene you're checking for its index in genes.

One not-pure NumPy approach that I can think of will take \$\mathcal{O}(n \log n)\$ time for this (explanation in comments).

from collections import defaultdict
from itertools import count
import numpy as np
def avg_dups_fast(genes, values):
 # Find the sorted indices of all genes so that we can group them together
 sorted_indices = np.argsort(genes)
 # Now create two arrays using `sorted_indices` where similar genes and
 # the corresponding values are now grouped together
 sorted_genes = genes[sorted_indices]
 sorted_values = values[sorted_indices]
 # Now to find each individual group we need to find the index where the
 # gene value changes. We can use `numpy.where` with `numpy.diff` for this.
 # But as numpy.diff won't work with string, so we need to generate
 # some unique integers for genes, for that we can use
 # collections.defaultdict with itertools.count. 
 # This dict will generate a new integer as soon as a
 # new string is encountered and will save it as well so that same
 # value is used for repeated strings. 
 d = defaultdict(count(0).next)
 unique_ints = np.fromiter((d[x] for x in sorted_genes), dtype=int)
 # Now get the indices
 split_at = np.where(np.diff(unique_ints)!=0)[0] + 1
 # split the `sorted_values` at those indices.
 split_items = np.array_split(sorted_values, split_at)
 return np.unique(sorted_genes), np.array([np.mean(arr, axis=0) for arr in split_items])

Also a Pure Python approach that will take only \$\mathcal{O}(n)\$ time. Here I've simply used a dictionary with the gene as key and the corresponding values are going to be appended in a list:

from collections import defaultdict
from itertools import izip
def avg_dups_python(genes, values):
 d = defaultdict(list)
 for k, v in izip(genes, values):
 d[k].append(v)
 return list(d), [np.mean(val, axis=0) for val in d.itervalues()]

Timing comparisons:

>>> from string import ascii_letters
>>> from itertools import islice, product
>>> def generate_test_data(n):
 genes = np.array([''.join(x) for x in islice(product(ascii_letters, repeat=3), n)]*5, dtype='S3')
 np.random.shuffle(genes)
 vals = np.array([[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4
 [3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2
 [8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1
 [4.0, 4.0, 4.0, 3.0], # B: is fine
 [1.0, 1.0, 1.0, 1.0]]*n) # A: will be averaged with row=0
 return genes, vals
... 
>>> data = generate_test_data(20000)
>>> %timeit avg_dups(*data)
1 loops, best of 3: 18.4 s per loop
>>> %timeit avg_dups_fast(*data)
10 loops, best of 3: 166 ms per loop
>>> %timeit avg_dups_python(*data)
1 loops, best of 3: 253 ms per loop
>>> (avg_dups(*data)[1] == avg_dups_fast(*data)[1]).all()
True

Question 12

"we need to generate some unique integers for genes" → If they need to be unique, you should avoid hash - it does not guarantee uniqueness.

Question 13

@Veedrac Interesting! So, two different strings can hash to same value in a given process(considering strings have their own __hash__ method)?(Python 2.7) A simple alternative will be to use d = defaultdict(count(0).next), this will return a unique value each time a new string is encountered.

Question 14

Yes, the whole point of hashing a string is that the hash is short. Loss of information is inevitable, hence collisions.

Question 15

@JanneKarila You're right, not sure what I was thinking. :/

Question 16

@JanneKarila: Hash collisions don't take place due to "loss of information", but due to the fact that we do not have a perfect hash function --i.e. a hash function that guarantees that no two inputs will ever be mapped/hashed to the same output.

Veedrac Veedrac 9,77323 silver badges38 bronze badges · Accepted Answer · 2015-02-20 00:56:17Z

This seems to be the fastest so far:

import numpy
from numpy import newaxis
def avg_dups(genes, values):
 folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)
 output /= counts[:, newaxis]
 return folded, output

This finds the unique genes to fold the values into, along with the current index → new index mapping and the number of repeated values that map to the same index:

 folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)

It adds the row from each current index to the new index in the new output:

 output = numpy.zeros((folded.shape[0], values.shape[1]))
 numpy.add.at(output, indices, values)

numpy.add.at(output, indices, values) is used over output[indices] += values because the buffering used in += breaks the code for repeated indices.

The mean is taken with a simple division of the number of repeated values that map to the same index:

 output /= counts[:, newaxis]

Using Ashwini Chaudhary's generate_test_data(2000) (giving a 10000x4 array), my rough timings are:

name time/ms Author
avg_dups 230 gwg
avg_dups_fast 33 Ashwini Chaudhary
avg_dups_python 45 Ashwini Chaudhary
avg_dups 430 Veedrac
avg_dups 5 Veedrac with Jaime's improvement

You can do folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True) and spare yourself all those broadcasting comparisons, which typically cripple performance. On the other hand, nice use of add.at!
@Veedrac, this is very fast. Thanks! Averaging duplicates in my test SOFT file (originally picked at random) using my original function took 5.22903513908 seconds. Your function took 0.000233888626099 seconds. Thanks!

Stack Exchange Network

Averaging lists of values with duplicate keys

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Averaging lists of values with duplicate keys

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions