Split dictionary of lists into two dicts based on binary values

Question 1

I have a very large dictionary that contains key-value pairs where the key is a string and the value a list. All lists have the same lengths. I want to split up the dataset (dictionary) in two based on the values in one of the lists. One of the lists will contain binary values (i.e. 0 or 1). I want to split up the dictionaries so that one of them contains all those where the binary value is 0 and the other where it is 1.

(It is easier to understand this when you mentally transform the dataset in a table where each index of the lists is a row, and where the dictionary values are column names.)

I guess I could first convert the dict to a Pandas dataframe and then subset the dataframe, and convert the resulting dataframes back to dicts, but I only want to do this if it proves to be faster than doing it with dictionaries only.

The following code works (test it here), but as said I'm not sure about efficiency. I'm only interested in Python>= 3.6

from collections import defaultdict
def split_d(d, key):
 include = defaultdict(list)
 exclude = defaultdict(list)
 for i in range(0, len(d[key])):
 binary = d[key][i]
 for k, v in d.items():
 if k == key:
 continue
 if binary == 0:
 exclude[k].append(v[i])
 elif binary == 1:
 include[k].append(v[i])
 else:
 raise ValueError(f"Key {key} is not binary. Expected values 0 or 1")
 return include, exclude
d = {
 'prof': [1,0,0,1,1,1,1,1,0,1],
 'val': [45,12,36,48,48,59,5,4,32,7]
}
print(split_d(d, 'prof'))

Question 2

Will the dictionary contain more keys which also need to be separated or is it just 'val'?

Question 3

@Graipher The dictionary will contain more keys than the one.

Question 4

You should indent with 4 spaces, as per PEP8.

You can move your if out of the inner loop, and use enumerate, leading to a ~9% speed up.

def split_d(d, key):
 include = defaultdict(list)
 exclude = defaultdict(list)
 for i, binary in enumerate(d[key]):
 if binary == 1:
 output = include
 elif binary == 0:
 output = exclude
 else:
 raise ValueError(f"Key {key} is not binary. Expected values 0 or 1")
 for k, v in d.items():
 if k == key:
 continue
 output[k].append(v[i])
 return include, exclude

You can also remove key from d and add it back when you've gone through both the loops. Leading to a further ~5% speed up.

def split_d(d, key):
 include = defaultdict(list)
 exclude = defaultdict(list)
 keys = d.pop(key)
 try:
 for i, binary in enumerate(keys):
 if binary == 1:
 output = include
 elif not binary:
 output = exclude
 else:
 raise ValueError(f"Key {key} is not binary. Expected values 0 or 1")
 for k, v in d.items():
 output[k].append(v[i])
 finally:
 d[key] = keys
 return include, exclude

Other than the above your code looks good and I don't think there is any way to increase the performance in vanilla Python. And so if this is not fast enough then looking into NumPy would be the best thing next.

Question 5

Thanks for the reply. The indents is a mistake that I overlooked, repl.it defaults to 2 space indents unfortunately. In my editor for production I do follow PEP indeed. I like enumerate but I find it especially clever to move the key out of the dictionary and thus doing away with the `if k == key``check. Nice! Thanks again.

Question 6

Out of curiosity, is there an advantage of using not binary over binary == 0?

Question 7

@BramVanroy I tried a timing of with and without and it slightly improved performance. IIRC Keywords are the fastest, then native operators, then functions. And so when I was micro-optimizing, I added that too. I remember hearing not not obj is faster than bool(obj) for example.

Question 8

I see, thanks. And why is it necessary to add the binary column back? Is it because when passing mutable items to a method, they will act as if they are passed by reference, thus modifying the original argument?

Question 9

@BramVanroy something like that. If you don't it'll have a potentially unwanted side affect.

Question 10

Style comments

Python has an official style-guide, PEP8. It recommends consistently using 4 spaces as indentation.

You should also add a docstring to your function describing what it does.

Algorithm comments

In the case where the dictionary has only two keys, one on which to split and one that contains the values, you can easily use itertools.groupby to achieve a similar effect:

import itertools
def split_d(d, key, val):
 """Split the dictionary `d` into two, based on the binary classifier `d[key]`."""
 assert set(d[key]) == {0, 1}
 sorted_tuples = sorted(zip(d[key], d[val]), key=lambda x: x[0])
 grouped = itertools.groupby(sorted_tuples, lambda x: x[0])
 return [{'val': [x[1] for x in g]} for _, g in grouped]

This performs similarly (actually, slightly faster) for the given dictionary:

In [52]: %timeit split_d_g(d, 'prof', 'val')
5.63 μs ± 68.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [53]: %timeit split_d_op(d, 'prof')
6.82 μs ± 597 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

However, it is a lot more concise and, IMO, more readable.

For larger inputs it stays this way (even though my algorithm needs to sort the input to itertools.groupby), but the improvements recommended in the answer by @Peilonrayz beats us both:

enter image description here

(Log is base 10 here.)

Question 11

Thanks for the reply. The indents is a mistake that I overlooked, repl.it defaults to 2 space indents unfortunately. In my editor for production I do follow PEP indeed. I like the use of groupby but I don't find your code to be more readable. Perhaps because I don't have a lot of experience, but I still find full-fledged loops to read more easily than list comprehensions. Finally, could you comment on the use of assert? From the documentation and this answer I assumed it's used for debugging only?

Question 12

I also very much appreciate the benchmark, very insightful!

Question 13

@BramVanroy Agreed on the getting used to part. It helps if you separate out some steps and give them names, as I did here, but a one-line solution using itertools is usually borderline un-readable. The separation also helps with keeping under 80 characters per line.

Question 14

@BramVanroy Regarding the assert, you are right, it is more for debugging (you can turn of the checking of assert statements). I just put it there instead of manually raising the exception because it was easier to write :)

Question 15

Could you share the benchmarking code you used? It would be interesting to see how Eric's answer compares to the previous examples!

Question 16

As you mentioned, Pandas or at least NumPy would do just fine. They're fast and the syntax is clean and straightforward for this example.

With NumPy

You just need to define a mask as a boolean array:

import numpy as np
mask = np.array([1, 0, 0, 1, 1, 1, 1, 1, 0, 1], dtype=np.bool)

And apply the mask or its invert:

val = np.array([45, 12, 36, 48, 48, 59, 5, 4, 32, 7])
val[mask]
# array([45, 48, 48, 59, 5, 4, 7])
val[~mask]
# array([12, 36, 32])

mask really needs to be a boolean array. You'd get an incorrect result otherwise:

val = np.array([45, 12, 36, 48, 48, 59, 5, 4, 32, 7])
mask = np.array([1, 0, 0, 1, 1, 1, 1, 1, 0, 1])
val[mask]
# array([12, 45, 45, 12, 12, 12, 12, 12, 45, 12])

With Pandas

You're working with dicts of arrays? That's basically what pandas.DataFrames are for!

import pandas as pd
import numpy as np
d = {
 'prof': [1,0,0,1,1,1,1,1,0,1],
 'val': [45,12,36,48,48,59,5,4,32,7],
 'test': [1, 2, 3, 4, 5, 6, 7, 8, 9,10]
}
key = 'prof'

Define your mask first, as with numpy:

mask = np.array(d.pop(key), dtype=np.bool)

Define your dataframe:

df = pd.DataFrame(d)

Mask it and export it as a dict of lists:

df[mask].to_dict('list')
# {'test': [1, 4, 5, 6, 7, 8, 10], 'val': [45, 48, 48, 59, 5, 4, 7]}
df[~mask].to_dict('list')
# {'test': [2, 3, 9], 'val': [12, 36, 32]}

Done! The huge advantage is that anyone with some experience of numpy or pandas will understand the code right away.

Question 17

Hi Eric, thanks for the reply. I was already thinking of numpy because it's so darn fast, but I'm not experienced with the module. I would be interested to see benchmark results comparing your solution to others.

Question 18

@BramVanroy: You're welcome. I've added a pandas example, which is even easier with your data format. I don't have time for benchmarkings but I dare say that it should be faster than a pure python solution, especially with large arrays.

score 5 · Accepted Answer · 2018-04-20 10:25:23Z

You should indent with 4 spaces, as per PEP8.

You can move your if out of the inner loop, and use enumerate, leading to a ~9% speed up.

def split_d(d, key):
 include = defaultdict(list)
 exclude = defaultdict(list)
 for i, binary in enumerate(d[key]):
 if binary == 1:
 output = include
 elif binary == 0:
 output = exclude
 else:
 raise ValueError(f"Key {key} is not binary. Expected values 0 or 1")
 for k, v in d.items():
 if k == key:
 continue
 output[k].append(v[i])
 return include, exclude

You can also remove key from d and add it back when you've gone through both the loops. Leading to a further ~5% speed up.

def split_d(d, key):
 include = defaultdict(list)
 exclude = defaultdict(list)
 keys = d.pop(key)
 try:
 for i, binary in enumerate(keys):
 if binary == 1:
 output = include
 elif not binary:
 output = exclude
 else:
 raise ValueError(f"Key {key} is not binary. Expected values 0 or 1")
 for k, v in d.items():
 output[k].append(v[i])
 finally:
 d[key] = keys
 return include, exclude

Other than the above your code looks good and I don't think there is any way to increase the performance in vanilla Python. And so if this is not fast enough then looking into NumPy would be the best thing next.

Thanks for the reply. The indents is a mistake that I overlooked, repl.it defaults to 2 space indents unfortunately. In my editor for production I do follow PEP indeed. I like enumerate but I find it especially clever to move the key out of the dictionary and thus doing away with the `if k == key``check. Nice! Thanks again.
Out of curiosity, is there an advantage of using not binary over binary == 0?
@BramVanroy I tried a timing of with and without and it slightly improved performance. IIRC Keywords are the fastest, then native operators, then functions. And so when I was micro-optimizing, I added that too. I remember hearing not not obj is faster than bool(obj) for example.
I see, thanks. And why is it necessary to add the binary column back? Is it because when passing mutable items to a method, they will act as if they are passed by reference, thus modifying the original argument?
@BramVanroy something like that. If you don't it'll have a potentially unwanted side affect.

Stack Exchange Network

Split dictionary of lists into two dicts based on binary values

3 Answers 3

Style comments

Algorithm comments

With NumPy

With Pandas

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Split dictionary of lists into two dicts based on binary values

3 Answers 3

Style comments

Algorithm comments

With NumPy

With Pandas

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions