I have several functions for merging some dictionaries but over time I created a more general function that would make all these others obsolete if it weren't slower.
I have the specialized (and several like it) functions that look like this:
def merge_keep_lowest(dict1, dict2, *dicts):
"""
Merge an arbitary number of :py:class:`dict`-like objects and keeps the
**lowest** encountered value for each key.
Parameters
----------
*dicts : dict-like
The dictionaries to be merged. At least two must be given.
Returns
-------
result : any type
The merged dictionaries.
"""
# Copy the first dict so the result's class and its properties are defined
result = dict1.copy()
# We only want to iterate once so combine the second and the other dicts.
dicts = (dict2,) + dicts
# Now iterate over each dictionary since there is no directly useable
# dict-method for this kind of operation
for d in dicts:
# Now iterate over each key of this dict.
# This way is faster than "for kw in d.keys()".
for kw in d:
# One could also use "try ... except KeyError ..." here instead of
# the "if kw in result". That would be a bit faster if all dicts
# contained mostly the same keys ... but since contain checks with
# dictionaries are relativly cheap - so it doesn't make a huge
# difference
if kw in result:
# The key was already present in the result so compare it and
# replace it if it is smaller.
if d[kw] < result[kw]:
result[kw] = d[kw]
else:
# If the key was not yet in the result dict just initialize it
result[kw] = d[kw]
return result
There are some more for replacing the key if it is higher/shorter/longer/... all look exactly the same except for the if d[kw] < result[kw]:
line.
Then I thought since <
is also defined as method on many data types I could generalize and allow arbitary method calls of that nature:
def merge_keep_value_by_method(dict1, dict2, *dicts, method=None):
"""
Merge an arbitary number of :py:class:`dict`-like objects and replaces the
temporary value for each key if it satisfies
``new_value.method(tmp_value)``.
Parameters
----------
*dicts : dict-like
The dictionaries to be merged. At least two must be given.
method : str
If a key is encountered during the merge that already is inserted in
the temporary result this value is replaced by the new value if
it satisfies the condition ``new_value.method(tmp_value)``.
Default is ``None`` but **must** be overridden to not produce an Error.
Returns
-------
result : any type
The merged dictionaries.
"""
result = dict1.copy()
dicts = (dict2,) + dicts
for d in dicts:
for kw in d:
if kw in result:
# Here is the difference!
if getattr(d[kw], method)(result[kw]):
result[kw] = d[kw]
else:
result[kw] = d[kw]
return result
But this could even be more generalized by allowing arbitary functions to be executed. This has the downside that one must wrap these if they would be avaiable as methods but allows much more freedom:
def merge_keep_value_by_func(dict1, dict2, *dicts, func=None):
"""
Merge an arbitary number of :py:class:`dict`-like objects and replaces the
temporary value for each key if it satisfies
``func(new_value, tmp_value)``.
Parameters
----------
*dicts : dict-like
The dictionaries to be merged. At least two must be given.
func : callable
If a key is encountered during the merge that already is inserted in
the temporary result this value is replaced by the new value if
it satisfies the condition ``func(new_value, tmp_value)``.
Default is ``None`` but **must** be overridden to not produce an Error.
Returns
-------
result : any type
The merged dictionaries.
"""
result = dict1.copy()
dicts = (dict2,) + dicts
for d in dicts:
for kw in d:
if kw in result:
# Here is the difference!
if func(d[kw], result[kw]):
result[kw] = d[kw]
else:
result[kw] = d[kw]
return result
so one achieves the same result with some (stripped down) exemplaric dictionaries:
a = {'a':1, 'b': 1}
b = {'a':1, 'b': 2}
and all three function calls do the same:
merge_keep_lowest(a, b)
merge_keep_value_by_method(a, b, method='__lt__')
merge_keep_value_by_func(a, b, func=lambda x, y: True if x < y else False)
but the difference here is the time it takes to evaluate the functions (with some bigger input dicts and more of them):
merge_keep_lowest(*lotsofdicts)
1000 loops, best of 3: 483 μs per loop
merge_keep_value_by_method(*lotsofdicts, method='__lt__')
1000 loops, best of 3: 1.29 ms per loop
merge_keep_value_by_func(*lotsofdicts, func=lambda x, y: True if x < y else False)
1000 loops, best of 3: 940 μs per loop
So if you have any comments or recommendations on the code let me know but my primarily question is:
Should I just wrap the more generalized in def merge_keep_lowest
? Like this:
def merge_keep_lowest(dict1, dict2, *dicts):
return merge_keep_value_by_func(*((dict1, dict2)+dicts), func=lambda x, y: True if x < y else False)
and don't care that it is slower or does it make sense to let very similar code exist in parallel and just keep all of them like they are? Since I'm sometimes using really big JSON or big normal dicts speed does sometimes make a difference but almost every operation is below one second even with the wrapper.
2 Answers 2
The version taking func
is clearly the nicest; the version which always chooses the lowest is only really better if speed is actually needed.
There's no reason to force at least two parameters. Just use *dicts
and default to {}
. It's simpler and nicer.
*_by_func
is just *_by
. *_by_method
is horrible and should be avoided - it's not even more general and it's all stringy.
Note that lambda x, y: True if x < y else False
is just lambda x, y: x < y
is just operator.lt
.
Further, you should really have a fold
function, not a comparator, so you can do stuff like
def merge_keep_lowest(*dicts):
return merge_dicts_by(*dicts, fold=min)
but then also
def merge_counts(*dicts):
return merge_dicts_by(dicts, fold=sum)
A fold
would look like
result[kw] = fold(result[kw], d[kw])
and any comparator comp(new, old)
can be turned into a fold with
lambda old, new: new if comp(new, old) else old
For example, for a comparator of
lambda new, old: new <= old
one has
lambda old, new: new if new <= old else old
or, simply stated,
min
-
\$\begingroup\$ Thank you for your feedback! Yes, totally forgot about the
operator
functions. But these fold functions are not what I wanted because I created thefunc
-version not to replace the other functions but to compare say the absolute of the numbers or the first element or the timestamp if the value was for example a measurement. And having such afold
as additional argument just makes the branching in the function again more complicated. Besides that I haven't had any reason to use those - if it get's more complicated with the data I'll normally switch to some table-engine likepandas
. \$\endgroup\$MSeifert– MSeifert2016年03月09日 21:48:50 +00:00Commented Mar 9, 2016 at 21:48 -
\$\begingroup\$ @MSeifert Given a comparator's implementation
COMP
you just usey if COMP else x
. It's hardly an obtuse amount of complexity and the existence of, say,min
makes it more than acceptable. \$\endgroup\$Veedrac– Veedrac2016年03月09日 22:01:18 +00:00Commented Mar 9, 2016 at 22:01 -
\$\begingroup\$ Do you mean something like
result[key] = fold(d[key], result[key])
? Could you expand your answer to make it more clear how you mean the implementation of yourmerge_dict_by
with fold? I'm very curious but I somehow don't get it what you mean. \$\endgroup\$MSeifert– MSeifert2016年03月09日 23:12:06 +00:00Commented Mar 9, 2016 at 23:12 -
\$\begingroup\$ @MSeifert I've clarified a bit. Hope that helps. \$\endgroup\$Veedrac– Veedrac2016年03月10日 23:01:48 +00:00Commented Mar 10, 2016 at 23:01
Benchmarking
I rigged up the following benchmark:
from random import randint
from timeit import Timer
from operator import lt
def all_eq(expected, *results):
return all(r == expected for r in results)
d1 = {randint(1, 500): randint(1, 1000) for _ in range(1000)}
d2 = {randint(1, 500): randint(1, 1000) for _ in range(1000)}
SETUP = 'from __main__ import d1, d2, merge_keep_lowest, merge_keep_value_by_method, merge_keep_value_by_func, from operator import lt'
TESTS = [
"merge_keep_lowest(d1, d2)",
"merge_keep_value_by_method(d1, d2, method='__lt__')",
"merge_keep_value_by_func(d1, d2, func=lambda x, y: True if x < y else False)",
]
assert all_eq(*map(eval, TESTS))
for test in TESTS:
print("{1:>3.05f} {0}".format(test, Timer(test, SETUP).timeit(10000)))
The results for your code are consistent with your rankings, even if the execution times aren't proportional.
0.83180 merge_keep_lowest(d1, d2)
1.49818 merge_keep_value_by_method(d1, d2, method='__lt__')
1.31030 merge_keep_value_by_func(d1, d2, func=lambda x, y: True if x < y else False)
Changing the comparator function
Obviously, merge_keep_value_by_func()
is both versatile, and faster than merge_keep_value_by_method()
. Can we do better?
It turns out that eliminating some silliness in the lambda
will improve the performance. Also, using operator.lt
instead of a lambda
will improve performance further.
TESTS = [
"merge_keep_value_by_func(d1, d2, func=lambda x, y: x < y)",
"merge_keep_value_by_func(d1, d2, func=lt)",
]
Results:
1.24377 merge_keep_value_by_func(d1, d2, func=lambda x, y: x < y)
1.02642 merge_keep_value_by_func(d1, d2, func=lt)
Now, that's only a ~20% penalty relative to merge_keep_lowest()
.
Back to basics
At its core, what you are doing is basically dict.copy()
and dict.update()
. What if we stripped it down to the bare minimum? (Unfortunately, dict.update()
doesn't return anything, so it can't be a one-liner.)
def merge_dicts_simple(dict1, dict2):
result = dict1.copy()
result.update(dict2)
return result
Then, we could write this generator expression, and get the same performance as your more complicated and rigid merge_keep_lowest()
:
0.87297 merge_dicts_simple(d1, ((k, v2) for k, v2 in d2.items() if k not in d1 or v2 < d1[k]))
To be fair, this isn't nearly the same function. It only accepts two dicts, and the technique doesn't generalize well to support more dicts.
Still, it's useful to observe that you can get that kind of performance while maintaining versatility, if you're willing to put more of the responsibility on the caller.
Restoring functionality
Building on that idea, can we restore the original functionality?
Here's a more compact way to write your merge_dicts_value_by_func()
.
def merge_dicts_callback(dict1, *dicts, include=lambda a, b: True):
result = dict1.copy()
for dict in dicts:
result.update((k, v2) for k, v2 in dict.items() if k not in dict1 or include(v2, dict1[k]))
return result
It's marginally slower than merge_dicts_value_by_func()
, but it's still pretty good for its simplicity.
1.07106 merge_dicts_callback(d1, d2, include=lt)
-
\$\begingroup\$ I need to investigate your suggestions in a bit more details. I often work with
OrderedDict
so theupdate
must keep the order of keys. Also the checks must beif k not in result or include(v2, result[k])
and then your function is a bit slower than mine - a factor of 20-30% (even though I appreciate that it is much shorter!) \$\endgroup\$MSeifert– MSeifert2016年03月09日 22:50:20 +00:00Commented Mar 9, 2016 at 22:50 -
\$\begingroup\$ I don't think there is any sane way to iterate through an
OrderedDict
out of order, so I wouldn't worry about that. \$\endgroup\$200_success– 200_success2016年03月09日 22:52:50 +00:00Commented Mar 9, 2016 at 22:52 -
\$\begingroup\$ Yes, I just saw you edited your update-call to a generator instead of the dict comprehension so that might not be a problem anymore. \$\endgroup\$MSeifert– MSeifert2016年03月09日 22:53:42 +00:00Commented Mar 9, 2016 at 22:53