Functions to merge dictionaries with a comparison

Question 1

I have several functions for merging some dictionaries but over time I created a more general function that would make all these others obsolete if it weren't slower.

I have the specialized (and several like it) functions that look like this:

def merge_keep_lowest(dict1, dict2, *dicts):
 """
 Merge an arbitary number of :py:class:`dict`-like objects and keeps the
 **lowest** encountered value for each key.
 Parameters
 ----------
 *dicts : dict-like
 The dictionaries to be merged. At least two must be given.
 Returns
 -------
 result : any type
 The merged dictionaries.
 """
 # Copy the first dict so the result's class and its properties are defined
 result = dict1.copy()
 # We only want to iterate once so combine the second and the other dicts.
 dicts = (dict2,) + dicts
 # Now iterate over each dictionary since there is no directly useable
 # dict-method for this kind of operation
 for d in dicts:
 # Now iterate over each key of this dict.
 # This way is faster than "for kw in d.keys()".
 for kw in d:
 # One could also use "try ... except KeyError ..." here instead of
 # the "if kw in result". That would be a bit faster if all dicts
 # contained mostly the same keys ... but since contain checks with
 # dictionaries are relativly cheap - so it doesn't make a huge
 # difference
 if kw in result:
 # The key was already present in the result so compare it and
 # replace it if it is smaller.
 if d[kw] < result[kw]:
 result[kw] = d[kw]
 else:
 # If the key was not yet in the result dict just initialize it
 result[kw] = d[kw]
 return result

There are some more for replacing the key if it is higher/shorter/longer/... all look exactly the same except for the if d[kw] < result[kw]: line.

Then I thought since < is also defined as method on many data types I could generalize and allow arbitary method calls of that nature:

def merge_keep_value_by_method(dict1, dict2, *dicts, method=None):
 """
 Merge an arbitary number of :py:class:`dict`-like objects and replaces the
 temporary value for each key if it satisfies
 ``new_value.method(tmp_value)``.
 Parameters
 ----------
 *dicts : dict-like
 The dictionaries to be merged. At least two must be given.
 method : str
 If a key is encountered during the merge that already is inserted in
 the temporary result this value is replaced by the new value if
 it satisfies the condition ``new_value.method(tmp_value)``.
 Default is ``None`` but **must** be overridden to not produce an Error.
 Returns
 -------
 result : any type
 The merged dictionaries.
 """
 result = dict1.copy()
 dicts = (dict2,) + dicts
 for d in dicts:
 for kw in d:
 if kw in result:
 # Here is the difference!
 if getattr(d[kw], method)(result[kw]):
 result[kw] = d[kw]
 else:
 result[kw] = d[kw]
 return result

But this could even be more generalized by allowing arbitary functions to be executed. This has the downside that one must wrap these if they would be avaiable as methods but allows much more freedom:

def merge_keep_value_by_func(dict1, dict2, *dicts, func=None):
 """
 Merge an arbitary number of :py:class:`dict`-like objects and replaces the
 temporary value for each key if it satisfies
 ``func(new_value, tmp_value)``.
 Parameters
 ----------
 *dicts : dict-like
 The dictionaries to be merged. At least two must be given.
 func : callable
 If a key is encountered during the merge that already is inserted in
 the temporary result this value is replaced by the new value if
 it satisfies the condition ``func(new_value, tmp_value)``.
 Default is ``None`` but **must** be overridden to not produce an Error.
 Returns
 -------
 result : any type
 The merged dictionaries.
 """
 result = dict1.copy()
 dicts = (dict2,) + dicts
 for d in dicts:
 for kw in d:
 if kw in result:
 # Here is the difference!
 if func(d[kw], result[kw]):
 result[kw] = d[kw]
 else:
 result[kw] = d[kw]
 return result

so one achieves the same result with some (stripped down) exemplaric dictionaries:

a = {'a':1, 'b': 1}
b = {'a':1, 'b': 2}

and all three function calls do the same:

merge_keep_lowest(a, b)
merge_keep_value_by_method(a, b, method='__lt__')
merge_keep_value_by_func(a, b, func=lambda x, y: True if x < y else False)

but the difference here is the time it takes to evaluate the functions (with some bigger input dicts and more of them):

merge_keep_lowest(*lotsofdicts)

1000 loops, best of 3: 483 μs per loop

merge_keep_value_by_method(*lotsofdicts, method='__lt__')

1000 loops, best of 3: 1.29 ms per loop

merge_keep_value_by_func(*lotsofdicts, func=lambda x, y: True if x < y else False)

1000 loops, best of 3: 940 μs per loop

So if you have any comments or recommendations on the code let me know but my primarily question is:

Should I just wrap the more generalized in def merge_keep_lowest? Like this:

def merge_keep_lowest(dict1, dict2, *dicts):
 return merge_keep_value_by_func(*((dict1, dict2)+dicts), func=lambda x, y: True if x < y else False)

and don't care that it is slower or does it make sense to let very similar code exist in parallel and just keep all of them like they are? Since I'm sometimes using really big JSON or big normal dicts speed does sometimes make a difference but almost every operation is below one second even with the wrapper.

Question 2

The version taking func is clearly the nicest; the version which always chooses the lowest is only really better if speed is actually needed.

There's no reason to force at least two parameters. Just use *dicts and default to {}. It's simpler and nicer.

*_by_func is just *_by. *_by_method is horrible and should be avoided - it's not even more general and it's all stringy.

Note that lambda x, y: True if x < y else False is just lambda x, y: x < y is just operator.lt.

Further, you should really have a fold function, not a comparator, so you can do stuff like

def merge_keep_lowest(*dicts):
 return merge_dicts_by(*dicts, fold=min)

but then also

def merge_counts(*dicts):
 return merge_dicts_by(dicts, fold=sum)

A fold would look like

result[kw] = fold(result[kw], d[kw])

and any comparator comp(new, old) can be turned into a fold with

lambda old, new: new if comp(new, old) else old

For example, for a comparator of

lambda new, old: new <= old

one has

lambda old, new: new if new <= old else old

or, simply stated,

min

Question 3

Thank you for your feedback! Yes, totally forgot about the operator functions. But these fold functions are not what I wanted because I created the func-version not to replace the other functions but to compare say the absolute of the numbers or the first element or the timestamp if the value was for example a measurement. And having such a fold as additional argument just makes the branching in the function again more complicated. Besides that I haven't had any reason to use those - if it get's more complicated with the data I'll normally switch to some table-engine like pandas.

Question 4

@MSeifert Given a comparator's implementation COMP you just use y if COMP else x. It's hardly an obtuse amount of complexity and the existence of, say, min makes it more than acceptable.

Question 5

Do you mean something like result[key] = fold(d[key], result[key])? Could you expand your answer to make it more clear how you mean the implementation of your merge_dict_by with fold? I'm very curious but I somehow don't get it what you mean.

Question 6

@MSeifert I've clarified a bit. Hope that helps.

Question 7

Benchmarking

I rigged up the following benchmark:

from random import randint
from timeit import Timer
from operator import lt
def all_eq(expected, *results):
 return all(r == expected for r in results)
d1 = {randint(1, 500): randint(1, 1000) for _ in range(1000)}
d2 = {randint(1, 500): randint(1, 1000) for _ in range(1000)}
SETUP = 'from __main__ import d1, d2, merge_keep_lowest, merge_keep_value_by_method, merge_keep_value_by_func, from operator import lt'
TESTS = [
 "merge_keep_lowest(d1, d2)",
 "merge_keep_value_by_method(d1, d2, method='__lt__')",
 "merge_keep_value_by_func(d1, d2, func=lambda x, y: True if x < y else False)",
]
assert all_eq(*map(eval, TESTS))
for test in TESTS:
 print("{1:>3.05f} {0}".format(test, Timer(test, SETUP).timeit(10000)))

The results for your code are consistent with your rankings, even if the execution times aren't proportional.

0.83180 merge_keep_lowest(d1, d2)
1.49818 merge_keep_value_by_method(d1, d2, method='__lt__')
1.31030 merge_keep_value_by_func(d1, d2, func=lambda x, y: True if x < y else False)

Changing the comparator function

Obviously, merge_keep_value_by_func() is both versatile, and faster than merge_keep_value_by_method(). Can we do better?

It turns out that eliminating some silliness in the lambda will improve the performance. Also, using operator.lt instead of a lambda will improve performance further.

TESTS = [
 "merge_keep_value_by_func(d1, d2, func=lambda x, y: x < y)",
 "merge_keep_value_by_func(d1, d2, func=lt)",
]

Results:

1.24377 merge_keep_value_by_func(d1, d2, func=lambda x, y: x < y)
1.02642 merge_keep_value_by_func(d1, d2, func=lt)

Now, that's only a ~20% penalty relative to merge_keep_lowest().

Back to basics

At its core, what you are doing is basically dict.copy() and dict.update(). What if we stripped it down to the bare minimum? (Unfortunately, dict.update() doesn't return anything, so it can't be a one-liner.)

def merge_dicts_simple(dict1, dict2):
 result = dict1.copy()
 result.update(dict2)
 return result

Then, we could write this generator expression, and get the same performance as your more complicated and rigid merge_keep_lowest():

0.87297 merge_dicts_simple(d1, ((k, v2) for k, v2 in d2.items() if k not in d1 or v2 < d1[k]))

To be fair, this isn't nearly the same function. It only accepts two dicts, and the technique doesn't generalize well to support more dicts.

Still, it's useful to observe that you can get that kind of performance while maintaining versatility, if you're willing to put more of the responsibility on the caller.

Restoring functionality

Building on that idea, can we restore the original functionality?

Here's a more compact way to write your merge_dicts_value_by_func().

def merge_dicts_callback(dict1, *dicts, include=lambda a, b: True):
 result = dict1.copy()
 for dict in dicts:
 result.update((k, v2) for k, v2 in dict.items() if k not in dict1 or include(v2, dict1[k]))
 return result

It's marginally slower than merge_dicts_value_by_func(), but it's still pretty good for its simplicity.

1.07106 merge_dicts_callback(d1, d2, include=lt)

Question 8

I need to investigate your suggestions in a bit more details. I often work with OrderedDict so the update must keep the order of keys. Also the checks must be if k not in result or include(v2, result[k]) and then your function is a bit slower than mine - a factor of 20-30% (even though I appreciate that it is much shorter!)

Question 9

I don't think there is any sane way to iterate through an OrderedDict out of order, so I wouldn't worry about that.

Question 10

Yes, I just saw you edited your update-call to a generator instead of the dict comprehension so that might not be a problem anymore.

Veedrac Veedrac 9,77323 silver badges38 bronze badges · Accepted Answer · 2016-03-09 20:11:58Z

The version taking func is clearly the nicest; the version which always chooses the lowest is only really better if speed is actually needed.

There's no reason to force at least two parameters. Just use *dicts and default to {}. It's simpler and nicer.

*_by_func is just *_by. *_by_method is horrible and should be avoided - it's not even more general and it's all stringy.

Note that lambda x, y: True if x < y else False is just lambda x, y: x < y is just operator.lt.

Further, you should really have a fold function, not a comparator, so you can do stuff like

def merge_keep_lowest(*dicts):
 return merge_dicts_by(*dicts, fold=min)

but then also

def merge_counts(*dicts):
 return merge_dicts_by(dicts, fold=sum)

A fold would look like

result[kw] = fold(result[kw], d[kw])

and any comparator comp(new, old) can be turned into a fold with

lambda old, new: new if comp(new, old) else old

For example, for a comparator of

lambda new, old: new <= old

one has

lambda old, new: new if new <= old else old

or, simply stated,

min

Thank you for your feedback! Yes, totally forgot about the operator functions. But these fold functions are not what I wanted because I created the func-version not to replace the other functions but to compare say the absolute of the numbers or the first element or the timestamp if the value was for example a measurement. And having such a fold as additional argument just makes the branching in the function again more complicated. Besides that I haven't had any reason to use those - if it get's more complicated with the data I'll normally switch to some table-engine like pandas.
@MSeifert Given a comparator's implementation COMP you just use y if COMP else x. It's hardly an obtuse amount of complexity and the existence of, say, min makes it more than acceptable.
Do you mean something like result[key] = fold(d[key], result[key])? Could you expand your answer to make it more clear how you mean the implementation of your merge_dict_by with fold? I'm very curious but I somehow don't get it what you mean.

Stack Exchange Network

Functions to merge dictionaries with a comparison

2 Answers 2

Benchmarking

Changing the comparator function

Back to basics

Restoring functionality

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Functions to merge dictionaries with a comparison

2 Answers 2

Benchmarking

Changing the comparator function

Back to basics

Restoring functionality

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions