I have a very large dictionary that contains key-value pairs where the key is a string and the value a list. All lists have the same lengths. I want to split up the dataset (dictionary) in two based on the values in one of the lists. One of the lists will contain binary values (i.e. 0 or 1). I want to split up the dictionaries so that one of them contains all those where the binary value is 0 and the other where it is 1.
(It is easier to understand this when you mentally transform the dataset in a table where each index of the lists is a row, and where the dictionary values are column names.)
I guess I could first convert the dict to a Pandas dataframe and then subset the dataframe, and convert the resulting dataframes back to dicts, but I only want to do this if it proves to be faster than doing it with dictionaries only.
The following code works (test it here), but as said I'm not sure about efficiency. I'm only interested in Python>= 3.6
from collections import defaultdict
def split_d(d, key):
include = defaultdict(list)
exclude = defaultdict(list)
for i in range(0, len(d[key])):
binary = d[key][i]
for k, v in d.items():
if k == key:
continue
if binary == 0:
exclude[k].append(v[i])
elif binary == 1:
include[k].append(v[i])
else:
raise ValueError(f"Key {key} is not binary. Expected values 0 or 1")
return include, exclude
d = {
'prof': [1,0,0,1,1,1,1,1,0,1],
'val': [45,12,36,48,48,59,5,4,32,7]
}
print(split_d(d, 'prof'))
3 Answers 3
You should indent with 4 spaces, as per PEP8.
You can move your if out of the inner loop, and use enumerate
, leading to a ~9% speed up.
def split_d(d, key):
include = defaultdict(list)
exclude = defaultdict(list)
for i, binary in enumerate(d[key]):
if binary == 1:
output = include
elif binary == 0:
output = exclude
else:
raise ValueError(f"Key {key} is not binary. Expected values 0 or 1")
for k, v in d.items():
if k == key:
continue
output[k].append(v[i])
return include, exclude
You can also remove key
from d
and add it back when you've gone through both the loops. Leading to a further ~5% speed up.
def split_d(d, key):
include = defaultdict(list)
exclude = defaultdict(list)
keys = d.pop(key)
try:
for i, binary in enumerate(keys):
if binary == 1:
output = include
elif not binary:
output = exclude
else:
raise ValueError(f"Key {key} is not binary. Expected values 0 or 1")
for k, v in d.items():
output[k].append(v[i])
finally:
d[key] = keys
return include, exclude
Other than the above your code looks good and I don't think there is any way to increase the performance in vanilla Python. And so if this is not fast enough then looking into NumPy would be the best thing next.
-
\$\begingroup\$ Thanks for the reply. The indents is a mistake that I overlooked, repl.it defaults to 2 space indents unfortunately. In my editor for production I do follow PEP indeed. I like
enumerate
but I find it especially clever to move thekey
out of the dictionary and thus doing away with the `if k == key``check. Nice! Thanks again. \$\endgroup\$Bram Vanroy– Bram Vanroy2018年04月20日 10:45:36 +00:00Commented Apr 20, 2018 at 10:45 -
\$\begingroup\$ Out of curiosity, is there an advantage of using
not binary
overbinary == 0
? \$\endgroup\$Bram Vanroy– Bram Vanroy2018年04月20日 10:52:34 +00:00Commented Apr 20, 2018 at 10:52 -
\$\begingroup\$ @BramVanroy I tried a timing of with and without and it slightly improved performance. IIRC Keywords are the fastest, then native operators, then functions. And so when I was micro-optimizing, I added that too. I remember hearing
not not obj
is faster thanbool(obj)
for example. \$\endgroup\$2018年04月20日 10:54:49 +00:00Commented Apr 20, 2018 at 10:54 -
\$\begingroup\$ I see, thanks. And why is it necessary to add the binary column back? Is it because when passing mutable items to a method, they will act as if they are passed by reference, thus modifying the original argument? \$\endgroup\$Bram Vanroy– Bram Vanroy2018年04月20日 10:58:44 +00:00Commented Apr 20, 2018 at 10:58
-
\$\begingroup\$ @BramVanroy something like that. If you don't it'll have a potentially unwanted side affect. \$\endgroup\$2018年04月20日 11:00:38 +00:00Commented Apr 20, 2018 at 11:00
Style comments
Python has an official style-guide, PEP8. It recommends consistently using 4 spaces as indentation.
You should also add a docstring
to your function describing what it does.
Algorithm comments
In the case where the dictionary has only two keys, one on which to split and one that contains the values, you can easily use itertools.groupby
to achieve a similar effect:
import itertools
def split_d(d, key, val):
"""Split the dictionary `d` into two, based on the binary classifier `d[key]`."""
assert set(d[key]) == {0, 1}
sorted_tuples = sorted(zip(d[key], d[val]), key=lambda x: x[0])
grouped = itertools.groupby(sorted_tuples, lambda x: x[0])
return [{'val': [x[1] for x in g]} for _, g in grouped]
This performs similarly (actually, slightly faster) for the given dictionary:
In [52]: %timeit split_d_g(d, 'prof', 'val')
5.63 μs ± 68.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [53]: %timeit split_d_op(d, 'prof')
6.82 μs ± 597 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
However, it is a lot more concise and, IMO, more readable.
For larger inputs it stays this way (even though my algorithm needs to sort the input to itertools.groupby
), but the improvements recommended in the answer by @Peilonrayz beats us both:
(Log is base 10 here.)
-
\$\begingroup\$ Thanks for the reply. The indents is a mistake that I overlooked, repl.it defaults to 2 space indents unfortunately. In my editor for production I do follow PEP indeed. I like the use of
groupby
but I don't find your code to be more readable. Perhaps because I don't have a lot of experience, but I still find full-fledged loops to read more easily than list comprehensions. Finally, could you comment on the use ofassert
? From the documentation and this answer I assumed it's used for debugging only? \$\endgroup\$Bram Vanroy– Bram Vanroy2018年04月20日 10:44:03 +00:00Commented Apr 20, 2018 at 10:44 -
\$\begingroup\$ I also very much appreciate the benchmark, very insightful! \$\endgroup\$Bram Vanroy– Bram Vanroy2018年04月20日 10:45:48 +00:00Commented Apr 20, 2018 at 10:45
-
1\$\begingroup\$ @BramVanroy Agreed on the getting used to part. It helps if you separate out some steps and give them names, as I did here, but a one-line solution using
itertools
is usually borderline un-readable. The separation also helps with keeping under 80 characters per line. \$\endgroup\$Graipher– Graipher2018年04月20日 11:54:55 +00:00Commented Apr 20, 2018 at 11:54 -
\$\begingroup\$ @BramVanroy Regarding the
assert
, you are right, it is more for debugging (you can turn of the checking ofassert
statements). I just put it there instead of manually raising the exception because it was easier to write :) \$\endgroup\$Graipher– Graipher2018年04月20日 11:55:39 +00:00Commented Apr 20, 2018 at 11:55 -
\$\begingroup\$ Could you share the benchmarking code you used? It would be interesting to see how Eric's answer compares to the previous examples! \$\endgroup\$Bram Vanroy– Bram Vanroy2018年04月20日 13:16:40 +00:00Commented Apr 20, 2018 at 13:16
As you mentioned, Pandas or at least NumPy would do just fine. They're fast and the syntax is clean and straightforward for this example.
With NumPy
You just need to define a mask as a boolean array:
import numpy as np
mask = np.array([1, 0, 0, 1, 1, 1, 1, 1, 0, 1], dtype=np.bool)
And apply the mask or its invert:
val = np.array([45, 12, 36, 48, 48, 59, 5, 4, 32, 7])
val[mask]
# array([45, 48, 48, 59, 5, 4, 7])
val[~mask]
# array([12, 36, 32])
mask
really needs to be a boolean array. You'd get an incorrect result otherwise:
val = np.array([45, 12, 36, 48, 48, 59, 5, 4, 32, 7])
mask = np.array([1, 0, 0, 1, 1, 1, 1, 1, 0, 1])
val[mask]
# array([12, 45, 45, 12, 12, 12, 12, 12, 45, 12])
With Pandas
You're working with dicts of arrays? That's basically what pandas.DataFrames are for!
import pandas as pd
import numpy as np
d = {
'prof': [1,0,0,1,1,1,1,1,0,1],
'val': [45,12,36,48,48,59,5,4,32,7],
'test': [1, 2, 3, 4, 5, 6, 7, 8, 9,10]
}
key = 'prof'
Define your mask first, as with numpy
:
mask = np.array(d.pop(key), dtype=np.bool)
Define your dataframe:
df = pd.DataFrame(d)
Mask it and export it as a dict of lists:
df[mask].to_dict('list')
# {'test': [1, 4, 5, 6, 7, 8, 10], 'val': [45, 48, 48, 59, 5, 4, 7]}
df[~mask].to_dict('list')
# {'test': [2, 3, 9], 'val': [12, 36, 32]}
Done! The huge advantage is that anyone with some experience of numpy or pandas will understand the code right away.
-
\$\begingroup\$ Hi Eric, thanks for the reply. I was already thinking of numpy because it's so darn fast, but I'm not experienced with the module. I would be interested to see benchmark results comparing your solution to others. \$\endgroup\$Bram Vanroy– Bram Vanroy2018年04月20日 12:54:51 +00:00Commented Apr 20, 2018 at 12:54
-
\$\begingroup\$ @BramVanroy: You're welcome. I've added a pandas example, which is even easier with your data format. I don't have time for benchmarkings but I dare say that it should be faster than a pure python solution, especially with large arrays. \$\endgroup\$Eric Duminil– Eric Duminil2018年04月20日 13:06:32 +00:00Commented Apr 20, 2018 at 13:06
'val'
? \$\endgroup\$