Flatten a nested dict structure in Python

Question 1

For some post-processing, I need to flatten a structure like this

{'foo': {
 'cat': {'name': 'Hodor', 'age': 7},
 'dog': {'name': 'Mordor', 'age': 5}},
 'bar': { 'rat': {'name': 'Izidor', 'age': 3}}
}

Each bottom entries will appear as a row on the output. The heading keys will appear each row, flattened. Perhaps an example is better than my mediocre explanation:

[{'age': 5, 'animal': 'dog', 'foobar': 'foo', 'name': 'Mordor'},
 {'age': 7, 'animal': 'cat', 'foobar': 'foo', 'name': 'Hodor'},
 {'age': 3, 'animal': 'rat', 'foobar': 'bar', 'name': 'Izidor'}]

I first wrote this function:

def flatten(data, primary_keys):
 out = []
 keys = copy.copy(primary_keys)
 keys.reverse()
 def visit(node, primary_values, prim):
 if len(prim):
 p = prim.pop()
 for key, child in node.iteritems():
 primary_values[p] = key
 visit(child, primary_values, copy.copy(prim))
 else:
 new = copy.copy(node)
 new.update(primary_values)
 out.append(new)
 visit(data, { }, keys)
 return out
out = flatten(a, ['foobar', 'animal'])

I was not really satisfied because I have to use copy.copy to protect my input arguments. Obviously, when using flatten one does not want its input data to be altered.

So I thought about one alternative that uses more global variables (at least global to flatten) and uses an index instead of directly passing primary_keys to visit. However, this does not really help me to get rid of the ugly initial copy:

keys = copy.copy(primary_keys)
keys.reverse()

So here is my final version:

def flatten(data, keys):
 data = copy.copy(data)
 keys = copy.copy(keys)
 keys.reverse()
 out = []
 values = {}
 def visit(node, id):
 if id:
 id -= 1
 for key, child in node.iteritems():
 values[keys[id]] = key
 visit(child, id)
 else:
 node.update(values)
 out.append(node)
 visit(data, len(keys))
 return out

I am sure some Python magic will help in this case.

Question 2

Where does 'foobar' and 'animal' come from? They appear nowhere in your calling example. Shouldn't it be out = flatten(a, ['foobar', 'animal']) instead?

Question 3

Absolutely, my mistake !

Question 4

Both algorithms recurse using the length of keys to stop, so I am going to assume that the nested dictionaries always have the same level of nesting too. If your input can be of the form:

{'foo': {
 'cat': {'name': 'Hodor', 'age': 7},
 'dog': {'name': 'Mordor', 'age': 5}},
 'bar': { 'rat': {'name': 'Izidor', 'age': 3}},
 'baz': 'woops',
}

then your approach can't handle it and neither will mine.

I quickly stopped trying to understand how your algorithm work and started to think about how I would implement it myself. This indicates that:

your algorithm is not that trivial;
it is poorly documented.

You should at least have comments indicating why you use some approaches: reversing the keys and iterating over them in decreasing order, storing your group names/values (values[keys[id]] = key) as you go into nesting levels and updating the last dictionary when you reach it...

Speaking about updating the last dictionary, note that data = copy.copy(data) does not protect your node.update(values) to modify the original data in place. You either need to use copy.deepcopy or to change the updated dictionary (create a new one and update it with both node and values).

Now let me show you an other approach. Rather than wrapping a function that access global variables (this is what visit look like) into flatten, you can make flatten the recursive function by splitting keys into its head and tail part. When there is no element left, you won't be able to do it and you can stop the recursion by returning the data you're on: this is one of the most nested dictionaries.

Otherwise, you can iterate over the key/values pairs, flatten the values using the tail as a new set of keys and then, build a list out of the flattened values and the {head: key} dictionary.

To make things a bit more efficient, I'll use generators instead of building lists, so you’ll want to change your calls from out = flatten(a, ['foobar', 'animal']) to out = list(flatten(a, ['foobar', 'animal'])) (calls of the form for flattened in flatten(a, ['foobar', 'animal']): don't need to be changed though):

def flatten(data, group_names):
 try:
 group, group_names = group_names[0], group_names[1:]
 except IndexError:
 # No more key to extract, we just reached the most nested dictionnary
 yield data.copy() # Build a new dict so we don't modify data in place
 return # Nothing more to do, it is already considered flattened
 for key, value in data.iteritems():
 # value can contain nested dictionaries
 # so flatten it and iterate over the result
 for flattened in flatten(value, group_names):
 flattened.update({group: key})
 yield flattened

I also changed keys to group_names to be able to use the generic names key and value when iterating over data.

In case the input data can ever contain less nested levels than the amount of items in group_names, you'll reach a point where data.iteritems() will raise and AttributeError. You can catch that if you so choose:

def flatten(data, group_names):
 try:
 group, group_names = group_names[0], group_names[1:]
 except IndexError:
 # No more key to extract, we just reached the most nested dictionnary
 yield data.copy() # Build a new dict so we don't modify data in place
 return # Nothing more to do, it is already considered flattened
 try:
 for key, value in data.iteritems():
 # value can contain nested dictionaries
 # so flatten it and iterate over the result
 for flattened in flatten(value, group_names):
 flattened.update({group: key})
 yield flattened
 except AttributeError:
 yield {group: data}

So

a = {
 'foo': {
 'cat': {'name': 'Hodor', 'age': 7},
 'dog': {'name': 'Mordor', 'age': 5},
 },
 'bar': {
 'rat': {'name': 'Izidor', 'age': 3},
 },
 'baz': 'woops',
}
list(flatten(a, ['foobar', 'animal']))

will return

[{'animal': 'woops', 'foobar': 'baz'},
 {'age': 5, 'animal': 'dog', 'foobar': 'foo', 'name': 'Mordor'},
 {'age': 7, 'animal': 'cat', 'foobar': 'foo', 'name': 'Hodor'},
 {'age': 3, 'animal': 'rat', 'foobar': 'bar', 'name': 'Izidor'}]

Question 5

Thanks, this will surely help me in the future. So I will retain these two basic concepts. First, the use of generators and second, your elegant way of shifting a list element, a = a[0], a[1:]. That said I am still wondering how the try/catch exception handling will impact the performances. I may want to do some tests about this.

Question 6

@nowox It is even neater in Python 3: element, *a = a; it raises ValueError though. Exception handling is pretty good in python. I doubt that if group_names: ... else: yield ... will be faster.

score 3 · Accepted Answer · 2016-05-25 12:54:25Z

Both algorithms recurse using the length of keys to stop, so I am going to assume that the nested dictionaries always have the same level of nesting too. If your input can be of the form:

{'foo': {
 'cat': {'name': 'Hodor', 'age': 7},
 'dog': {'name': 'Mordor', 'age': 5}},
 'bar': { 'rat': {'name': 'Izidor', 'age': 3}},
 'baz': 'woops',
}

then your approach can't handle it and neither will mine.

I quickly stopped trying to understand how your algorithm work and started to think about how I would implement it myself. This indicates that:

your algorithm is not that trivial;
it is poorly documented.

You should at least have comments indicating why you use some approaches: reversing the keys and iterating over them in decreasing order, storing your group names/values (values[keys[id]] = key) as you go into nesting levels and updating the last dictionary when you reach it...

Speaking about updating the last dictionary, note that data = copy.copy(data) does not protect your node.update(values) to modify the original data in place. You either need to use copy.deepcopy or to change the updated dictionary (create a new one and update it with both node and values).

Now let me show you an other approach. Rather than wrapping a function that access global variables (this is what visit look like) into flatten, you can make flatten the recursive function by splitting keys into its head and tail part. When there is no element left, you won't be able to do it and you can stop the recursion by returning the data you're on: this is one of the most nested dictionaries.

Otherwise, you can iterate over the key/values pairs, flatten the values using the tail as a new set of keys and then, build a list out of the flattened values and the {head: key} dictionary.

To make things a bit more efficient, I'll use generators instead of building lists, so you’ll want to change your calls from out = flatten(a, ['foobar', 'animal']) to out = list(flatten(a, ['foobar', 'animal'])) (calls of the form for flattened in flatten(a, ['foobar', 'animal']): don't need to be changed though):

def flatten(data, group_names):
 try:
 group, group_names = group_names[0], group_names[1:]
 except IndexError:
 # No more key to extract, we just reached the most nested dictionnary
 yield data.copy() # Build a new dict so we don't modify data in place
 return # Nothing more to do, it is already considered flattened
 for key, value in data.iteritems():
 # value can contain nested dictionaries
 # so flatten it and iterate over the result
 for flattened in flatten(value, group_names):
 flattened.update({group: key})
 yield flattened

I also changed keys to group_names to be able to use the generic names key and value when iterating over data.

In case the input data can ever contain less nested levels than the amount of items in group_names, you'll reach a point where data.iteritems() will raise and AttributeError. You can catch that if you so choose:

def flatten(data, group_names):
 try:
 group, group_names = group_names[0], group_names[1:]
 except IndexError:
 # No more key to extract, we just reached the most nested dictionnary
 yield data.copy() # Build a new dict so we don't modify data in place
 return # Nothing more to do, it is already considered flattened
 try:
 for key, value in data.iteritems():
 # value can contain nested dictionaries
 # so flatten it and iterate over the result
 for flattened in flatten(value, group_names):
 flattened.update({group: key})
 yield flattened
 except AttributeError:
 yield {group: data}

So

a = {
 'foo': {
 'cat': {'name': 'Hodor', 'age': 7},
 'dog': {'name': 'Mordor', 'age': 5},
 },
 'bar': {
 'rat': {'name': 'Izidor', 'age': 3},
 },
 'baz': 'woops',
}
list(flatten(a, ['foobar', 'animal']))

will return

[{'animal': 'woops', 'foobar': 'baz'},
 {'age': 5, 'animal': 'dog', 'foobar': 'foo', 'name': 'Mordor'},
 {'age': 7, 'animal': 'cat', 'foobar': 'foo', 'name': 'Hodor'},
 {'age': 3, 'animal': 'rat', 'foobar': 'bar', 'name': 'Izidor'}]

Thanks, this will surely help me in the future. So I will retain these two basic concepts. First, the use of generators and second, your elegant way of shifting a list element, a = a[0], a[1:]. That said I am still wondering how the try/catch exception handling will impact the performances. I may want to do some tests about this.
@nowox It is even neater in Python 3: element, *a = a; it raises ValueError though. Exception handling is pretty good in python. I doubt that if group_names: ... else: yield ... will be faster.

Stack Exchange Network

Flatten a nested dict structure in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Flatten a nested dict structure in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions