Normalizing mixed format nested dictionary with delimiter separated keys

Question 1

I recently worked with some pickle files which had a really strange format where the keys are sometimes delimiter separated with colons and placed in the same depth but also mixed with ordinary key value pairs.

This is a basic example of this mixed format:

{
 "objects": {
 "list::1": {
 "attr1": "foo"
 "attr2": "bar"
 },
 "list::2": {
 "attr1": "foo"
 "attr2": "bar"
 },
 "list::3": {
 "attr1": "foo"
 "attr2": "bar"
 "nested::data::inner": {
 "test": 1
 }
 }
 },
 "dates": {
 "2018::11::01": true,
 "2018::11::02": false,
 "2018::10::02": false,
 }
}

The challenge was to convert this to a .mat file so it can be analyzed using matlab. However, because the length of the fields are sometimes> 63 (the delimiter separated string) the conversion was unsuccessful. This is my implementation on normalizing these mixed format nested dictionaries.

Maybe there is an inverse to json_normalize() from pandas library to do this?

import json
mydict = {
 "a": {
 "test::test2::test3::test4": {
 "name": "ok",
 "age": 1
 },
 "test::test2::test4::test4": {
 "name": "ok1",
 "age": 2
 },
 "test::test2::test4::test5": {
 "name": "ok1",
 "body::head::foot": {
 "age": 2,
 "thing": "test"
 }
 }
 },
 "b": {
 "name": "ok2",
 "age": "test2"
 }
}
def set_nested(data, args, new_val):
 if args and data:
 element = args[0]
 if element:
 value = data.get(element)
 if len(args) == 1:
 data[element] = new_val
 else:
 set_nested(value, args[1:], new_val)
from collections import MutableMapping
# see https://stackoverflow.com/questions/7204805/dictionaries-of-dictionaries-merge/24088493#24088493
def rec_merge(d1, d2):
 '''
 Update two dicts of dicts recursively, 
 if either mapping has leaves that are non-dicts, 
 the second's leaf overwrites the first's.
 '''
 for k, v in d1.items(): # in Python 2, use .iteritems()!
 if k in d2:
 # this next check is the only difference!
 if all(isinstance(e, MutableMapping) for e in (v, d2[k])):
 d2[k] = rec_merge(v, d2[k])
 # we could further check types and merge as appropriate here.
 d3 = d1.copy()
 d3.update(d2)
 return d3 
def normalize(old_dict, delim="::"):
 new_dict = { }
 for key in old_dict.keys():
 splitted = key.split(delim)
 is_split = len(splitted) > 1
 if is_split:
 new_key = splitted[0]
 x = new_dict.get(new_key)
 if not x or not isinstance(x, dict):
 x = {}
 y = {}
 for s in reversed(splitted[1:]):
 y = {s: y}
 set_nested(y, splitted[1:], old_dict[key])
 new_val = rec_merge(x, y)
 new_dict[new_key] = new_val
 if isinstance(old_dict[key], dict):
 new_dict[new_key] = normalize(new_dict[new_key])
 else:
 if isinstance(old_dict[key], dict):
 new_dict[key] = normalize(old_dict[key])
 else:
 new_dict[key] = old_dict[key]
 return new_dict
print (json.dumps(normalize(mydict), indent=2))

This is the expected output after normalizing:

{
 "a": {
 "test": {
 "test2": {
 "test3": {
 "test4": {
 "name": "ok",
 "age": 1
 }
 },
 "test4": {
 "test4": {
 "name": "ok1",
 "age": 2
 },
 "test5": {
 "name": "ok1",
 "body": {
 "head": {
 "foot": {
 "age": 2,
 "thing": "test"
 }
 }
 }
 }
 }
 }
 }
 },
 "b": {
 "name": "ok2",
 "age": "test2"
 }
}

Is there a way to simplify the logic in normalize function?

Question 2

This...sounds like a weird format. Your first example I would not have interpreted that way, it looks quite different to the example you are parsing in your script (no multiple levels). In any case, is the outer level always just one level deep (i.e. is the first level always a true level and not nested with delimiters)? And are the final elements guaranteed to be unique or is e.g. "a::b": {"c": 2}, "a::b::c": {"d": 3} a possibility and how to deal with it?

Question 3

@Graipher it is a weird format and that's why I want to normalize it. Yes the first example might have been oversimplified. I can't change the format stored in the pickle files but I can guarantee that the outer level is always a normal key without delimiters. The example you provided is not how the pickle files are formatted. I'll update the example with a more clear description.

Question 4

Ok, so you have a arbitrary deep path and need to get a dictionary out of it. For this you can use an infinitely nestable defaultdict, as shown in this answer by @sth:

from collections import defaultdict
class InfiniteDict(defaultdict):
 def __init__(self):
 defaultdict.__init__(self, self.__class__)

This allows you to write stuff like d["a"]["b"]["c"] = 3 and it will automatically create all nested layers. It allows you to parse the dictionary recursively. The outer dictionary can be handled in the same way as the inner dictionaries, because *a, b = "foo".split("::") -> a, b = [], "foo".

def parse(d):
 # reached a leave
 if not isinstance(d, dict):
 return d
 out = InfiniteDict()
 for path, values in d.items():
 # parse the path, if possible
 try:
 *path, key = path.split("::")
 except AttributeError:
 # do nothing if path is not a string
 path, key = [], path
 # follow the path down almost to the end
 # noop if path = []
 temp = out
 for x in path:
 temp = temp[x]
 # assign it to the last part of the path
 # need to parse that as well, in case it has another path
 # works only `sys.getrecursionlimit()` levels deep, obviously
 temp[key] = parse(values) 
 return out

For the given example this produces:

InfiniteDict(__main__.InfiniteDict,
 {'a': InfiniteDict(__main__.InfiniteDict,
 {'test': InfiniteDict(__main__.InfiniteDict,
 {'test2': InfiniteDict(__main__.InfiniteDict,
 {'test3': InfiniteDict(__main__.InfiniteDict,
 {'test4': InfiniteDict(__main__.InfiniteDict,
 {'age': 1,
 'name': 'ok'})}),
 'test4': InfiniteDict(__main__.InfiniteDict,
 {'test4': InfiniteDict(__main__.InfiniteDict,
 {'age': 2,
 'name': 'ok1'}),
 'test5': InfiniteDict(__main__.InfiniteDict,
 {'body': InfiniteDict(__main__.InfiniteDict,
 {'head': InfiniteDict(__main__.InfiniteDict,
 {'foot': InfiniteDict(__main__.InfiniteDict,
 {'age': 2,
 'thing': 'test'})})}),
 'name': 'ok1'})})})})}),
 'b': InfiniteDict(__main__.InfiniteDict,
 {'age': 'test2', 'name': 'ok2'})})

Which looks worse than it is, because InfiniteDict inherits from dict in the end:

isinstance(InfiniteDict(), dict)
# True
InfiniteDict.mro()
# [__main__.InfiniteDict, collections.defaultdict, dict, object]

And so you can json.dumps it, just like you did in your code:

import json
...
if __name__ == "__main__":
 mydict = {...}
 print(json.dumps(parse(mydict), indent=2))
# {
# "a": {
# "test": {
# "test2": {
# "test3": {
# "test4": {
# "name": "ok",
# "age": 1
# }
# },
# "test4": {
# "test4": {
# "name": "ok1",
# "age": 2
# },
# "test5": {
# "name": "ok1",
# "body": {
# "head": {
# "foot": {
# "age": 2,
# "thing": "test"
# }
# }
# }
# }
# }
# }
# }
# },
# "b": {
# "name": "ok2",
# "age": "test2"
# }
# }

The advantage of this is that the InfiniteDict deals with most of the nasty recursive stuff, and the only thing left to do is make paths out of the strings, if necessary.

Question 5

Great solution, it works perfect for converting the pickle file to a mat file. I don't understand the decision for using this strange format since it can't be imported easily into matlab or similar and the guy who is responsible simply said "you have to do the conversion yourself"...

Question 6

@Linus Well, it is a slightly compressed version, compared to json. But the gains are not much, especially if you have to parse it manually...

Question 7

Yeah, the pickle files are about 140MB big so the compression might be helpful in someway but it's not a easy format to work with. Anyway, thank you for the time and code review :)

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2019-11-26 13:21:01Z

Ok, so you have a arbitrary deep path and need to get a dictionary out of it. For this you can use an infinitely nestable defaultdict, as shown in this answer by @sth:

from collections import defaultdict
class InfiniteDict(defaultdict):
 def __init__(self):
 defaultdict.__init__(self, self.__class__)

This allows you to write stuff like d["a"]["b"]["c"] = 3 and it will automatically create all nested layers. It allows you to parse the dictionary recursively. The outer dictionary can be handled in the same way as the inner dictionaries, because *a, b = "foo".split("::") -> a, b = [], "foo".

def parse(d):
 # reached a leave
 if not isinstance(d, dict):
 return d
 out = InfiniteDict()
 for path, values in d.items():
 # parse the path, if possible
 try:
 *path, key = path.split("::")
 except AttributeError:
 # do nothing if path is not a string
 path, key = [], path
 # follow the path down almost to the end
 # noop if path = []
 temp = out
 for x in path:
 temp = temp[x]
 # assign it to the last part of the path
 # need to parse that as well, in case it has another path
 # works only `sys.getrecursionlimit()` levels deep, obviously
 temp[key] = parse(values) 
 return out

For the given example this produces:

InfiniteDict(__main__.InfiniteDict,
 {'a': InfiniteDict(__main__.InfiniteDict,
 {'test': InfiniteDict(__main__.InfiniteDict,
 {'test2': InfiniteDict(__main__.InfiniteDict,
 {'test3': InfiniteDict(__main__.InfiniteDict,
 {'test4': InfiniteDict(__main__.InfiniteDict,
 {'age': 1,
 'name': 'ok'})}),
 'test4': InfiniteDict(__main__.InfiniteDict,
 {'test4': InfiniteDict(__main__.InfiniteDict,
 {'age': 2,
 'name': 'ok1'}),
 'test5': InfiniteDict(__main__.InfiniteDict,
 {'body': InfiniteDict(__main__.InfiniteDict,
 {'head': InfiniteDict(__main__.InfiniteDict,
 {'foot': InfiniteDict(__main__.InfiniteDict,
 {'age': 2,
 'thing': 'test'})})}),
 'name': 'ok1'})})})})}),
 'b': InfiniteDict(__main__.InfiniteDict,
 {'age': 'test2', 'name': 'ok2'})})

Which looks worse than it is, because InfiniteDict inherits from dict in the end:

isinstance(InfiniteDict(), dict)
# True
InfiniteDict.mro()
# [__main__.InfiniteDict, collections.defaultdict, dict, object]

And so you can json.dumps it, just like you did in your code:

import json
...
if __name__ == "__main__":
 mydict = {...}
 print(json.dumps(parse(mydict), indent=2))
# {
# "a": {
# "test": {
# "test2": {
# "test3": {
# "test4": {
# "name": "ok",
# "age": 1
# }
# },
# "test4": {
# "test4": {
# "name": "ok1",
# "age": 2
# },
# "test5": {
# "name": "ok1",
# "body": {
# "head": {
# "foot": {
# "age": 2,
# "thing": "test"
# }
# }
# }
# }
# }
# }
# }
# },
# "b": {
# "name": "ok2",
# "age": "test2"
# }
# }

The advantage of this is that the InfiniteDict deals with most of the nasty recursive stuff, and the only thing left to do is make paths out of the strings, if necessary.

Great solution, it works perfect for converting the pickle file to a mat file. I don't understand the decision for using this strange format since it can't be imported easily into matlab or similar and the guy who is responsible simply said "you have to do the conversion yourself"...
@Linus Well, it is a slightly compressed version, compared to json. But the gains are not much, especially if you have to parse it manually...
Yeah, the pickle files are about 140MB big so the compression might be helpful in someway but it's not a easy format to work with. Anyway, thank you for the time and code review :)

Stack Exchange Network

Normalizing mixed format nested dictionary with delimiter separated keys

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Normalizing mixed format nested dictionary with delimiter separated keys

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions