For some post-processing, I need to flatten a structure like this
{'foo': { 'cat': {'name': 'Hodor', 'age': 7}, 'dog': {'name': 'Mordor', 'age': 5}}, 'bar': { 'rat': {'name': 'Izidor', 'age': 3}} }
Each bottom entries will appear as a row on the output. The heading keys will appear each row, flattened. Perhaps an example is better than my mediocre explanation:
[{'age': 5, 'animal': 'dog', 'foobar': 'foo', 'name': 'Mordor'}, {'age': 7, 'animal': 'cat', 'foobar': 'foo', 'name': 'Hodor'}, {'age': 3, 'animal': 'rat', 'foobar': 'bar', 'name': 'Izidor'}]
I first wrote this function:
def flatten(data, primary_keys):
out = []
keys = copy.copy(primary_keys)
keys.reverse()
def visit(node, primary_values, prim):
if len(prim):
p = prim.pop()
for key, child in node.iteritems():
primary_values[p] = key
visit(child, primary_values, copy.copy(prim))
else:
new = copy.copy(node)
new.update(primary_values)
out.append(new)
visit(data, { }, keys)
return out
out = flatten(a, ['foobar', 'animal'])
I was not really satisfied because I have to use copy.copy
to protect my input arguments. Obviously, when using flatten
one does not want its input data
to be altered.
So I thought about one alternative that uses more global variables (at least global to flatten
) and uses an index instead of directly passing primary_keys
to visit
. However, this does not really help me to get rid of the ugly initial copy:
keys = copy.copy(primary_keys)
keys.reverse()
So here is my final version:
def flatten(data, keys):
data = copy.copy(data)
keys = copy.copy(keys)
keys.reverse()
out = []
values = {}
def visit(node, id):
if id:
id -= 1
for key, child in node.iteritems():
values[keys[id]] = key
visit(child, id)
else:
node.update(values)
out.append(node)
visit(data, len(keys))
return out
I am sure some Python magic will help in this case.
1 Answer 1
Both algorithms recurse using the length of keys
to stop, so I am going to assume that the nested dictionaries always have the same level of nesting too. If your input can be of the form:
{'foo': {
'cat': {'name': 'Hodor', 'age': 7},
'dog': {'name': 'Mordor', 'age': 5}},
'bar': { 'rat': {'name': 'Izidor', 'age': 3}},
'baz': 'woops',
}
then your approach can't handle it and neither will mine.
I quickly stopped trying to understand how your algorithm work and started to think about how I would implement it myself. This indicates that:
- your algorithm is not that trivial;
- it is poorly documented.
You should at least have comments indicating why you use some approaches: reversing the keys
and iterating over them in decreasing order, storing your group names/values (values[keys[id]] = key
) as you go into nesting levels and updating the last dictionary when you reach it...
Speaking about updating the last dictionary, note that data = copy.copy(data)
does not protect your node.update(values)
to modify the original data
in place. You either need to use copy.deepcopy
or to change the updated dictionary (create a new one and update it with both node
and values
).
Now let me show you an other approach. Rather than wrapping a function that access global variables (this is what visit
look like) into flatten
, you can make flatten
the recursive function by splitting keys
into its head
and tail
part. When there is no element left, you won't be able to do it and you can stop the recursion by returning the data
you're on: this is one of the most nested dictionaries.
Otherwise, you can iterate over the key/values pairs, flatten the values using the tail
as a new set of keys
and then, build a list out of the flattened values and the {head: key}
dictionary.
To make things a bit more efficient, I'll use generators instead of building lists, so you’ll want to change your calls from out = flatten(a, ['foobar', 'animal'])
to out = list(flatten(a, ['foobar', 'animal']))
(calls of the form for flattened in flatten(a, ['foobar', 'animal']):
don't need to be changed though):
def flatten(data, group_names):
try:
group, group_names = group_names[0], group_names[1:]
except IndexError:
# No more key to extract, we just reached the most nested dictionnary
yield data.copy() # Build a new dict so we don't modify data in place
return # Nothing more to do, it is already considered flattened
for key, value in data.iteritems():
# value can contain nested dictionaries
# so flatten it and iterate over the result
for flattened in flatten(value, group_names):
flattened.update({group: key})
yield flattened
I also changed keys
to group_names
to be able to use the generic names key
and value
when iterating over data
.
In case the input data can ever contain less nested levels than the amount of items in group_names
, you'll reach a point where data.iteritems()
will raise and AttributeError
. You can catch that if you so choose:
def flatten(data, group_names):
try:
group, group_names = group_names[0], group_names[1:]
except IndexError:
# No more key to extract, we just reached the most nested dictionnary
yield data.copy() # Build a new dict so we don't modify data in place
return # Nothing more to do, it is already considered flattened
try:
for key, value in data.iteritems():
# value can contain nested dictionaries
# so flatten it and iterate over the result
for flattened in flatten(value, group_names):
flattened.update({group: key})
yield flattened
except AttributeError:
yield {group: data}
So
a = {
'foo': {
'cat': {'name': 'Hodor', 'age': 7},
'dog': {'name': 'Mordor', 'age': 5},
},
'bar': {
'rat': {'name': 'Izidor', 'age': 3},
},
'baz': 'woops',
}
list(flatten(a, ['foobar', 'animal']))
will return
[{'animal': 'woops', 'foobar': 'baz'},
{'age': 5, 'animal': 'dog', 'foobar': 'foo', 'name': 'Mordor'},
{'age': 7, 'animal': 'cat', 'foobar': 'foo', 'name': 'Hodor'},
{'age': 3, 'animal': 'rat', 'foobar': 'bar', 'name': 'Izidor'}]
-
\$\begingroup\$ Thanks, this will surely help me in the future. So I will retain these two basic concepts. First, the use of generators and second, your elegant way of shifting a list
element, a = a[0], a[1:]
. That said I am still wondering how thetry
/catch
exception handling will impact the performances. I may want to do some tests about this. \$\endgroup\$nowox– nowox2016年05月25日 20:18:33 +00:00Commented May 25, 2016 at 20:18 -
1\$\begingroup\$ @nowox It is even neater in Python 3:
element, *a = a
; it raisesValueError
though. Exception handling is pretty good in python. I doubt thatif group_names: ... else: yield ...
will be faster. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年05月25日 20:26:56 +00:00Commented May 25, 2016 at 20:26
'foobar'
and'animal'
come from? They appear nowhere in your calling example. Shouldn't it beout = flatten(a, ['foobar', 'animal'])
instead? \$\endgroup\$