I recently worked with some pickle files which had a really strange format where the keys are sometimes delimiter separated with colons and placed in the same depth but also mixed with ordinary key value pairs.
This is a basic example of this mixed format:
{
"objects": {
"list::1": {
"attr1": "foo"
"attr2": "bar"
},
"list::2": {
"attr1": "foo"
"attr2": "bar"
},
"list::3": {
"attr1": "foo"
"attr2": "bar"
"nested::data::inner": {
"test": 1
}
}
},
"dates": {
"2018::11::01": true,
"2018::11::02": false,
"2018::10::02": false,
}
}
The challenge was to convert this to a .mat file so it can be analyzed using matlab. However, because the length of the fields are sometimes> 63 (the delimiter separated string) the conversion was unsuccessful. This is my implementation on normalizing these mixed format nested dictionaries.
Maybe there is an inverse to json_normalize() from pandas library to do this?
import json
mydict = {
"a": {
"test::test2::test3::test4": {
"name": "ok",
"age": 1
},
"test::test2::test4::test4": {
"name": "ok1",
"age": 2
},
"test::test2::test4::test5": {
"name": "ok1",
"body::head::foot": {
"age": 2,
"thing": "test"
}
}
},
"b": {
"name": "ok2",
"age": "test2"
}
}
def set_nested(data, args, new_val):
if args and data:
element = args[0]
if element:
value = data.get(element)
if len(args) == 1:
data[element] = new_val
else:
set_nested(value, args[1:], new_val)
from collections import MutableMapping
# see https://stackoverflow.com/questions/7204805/dictionaries-of-dictionaries-merge/24088493#24088493
def rec_merge(d1, d2):
'''
Update two dicts of dicts recursively,
if either mapping has leaves that are non-dicts,
the second's leaf overwrites the first's.
'''
for k, v in d1.items(): # in Python 2, use .iteritems()!
if k in d2:
# this next check is the only difference!
if all(isinstance(e, MutableMapping) for e in (v, d2[k])):
d2[k] = rec_merge(v, d2[k])
# we could further check types and merge as appropriate here.
d3 = d1.copy()
d3.update(d2)
return d3
def normalize(old_dict, delim="::"):
new_dict = { }
for key in old_dict.keys():
splitted = key.split(delim)
is_split = len(splitted) > 1
if is_split:
new_key = splitted[0]
x = new_dict.get(new_key)
if not x or not isinstance(x, dict):
x = {}
y = {}
for s in reversed(splitted[1:]):
y = {s: y}
set_nested(y, splitted[1:], old_dict[key])
new_val = rec_merge(x, y)
new_dict[new_key] = new_val
if isinstance(old_dict[key], dict):
new_dict[new_key] = normalize(new_dict[new_key])
else:
if isinstance(old_dict[key], dict):
new_dict[key] = normalize(old_dict[key])
else:
new_dict[key] = old_dict[key]
return new_dict
print (json.dumps(normalize(mydict), indent=2))
This is the expected output after normalizing:
{
"a": {
"test": {
"test2": {
"test3": {
"test4": {
"name": "ok",
"age": 1
}
},
"test4": {
"test4": {
"name": "ok1",
"age": 2
},
"test5": {
"name": "ok1",
"body": {
"head": {
"foot": {
"age": 2,
"thing": "test"
}
}
}
}
}
}
}
},
"b": {
"name": "ok2",
"age": "test2"
}
}
Is there a way to simplify the logic in normalize function?
1 Answer 1
Ok, so you have a arbitrary deep path and need to get a dictionary out of it. For this you can use an infinitely nestable defaultdict
, as shown in this answer by @sth:
from collections import defaultdict
class InfiniteDict(defaultdict):
def __init__(self):
defaultdict.__init__(self, self.__class__)
This allows you to write stuff like d["a"]["b"]["c"] = 3
and it will automatically create all nested layers. It allows you to parse the dictionary recursively. The outer dictionary can be handled in the same way as the inner dictionaries, because *a, b = "foo".split("::") -> a, b = [], "foo"
.
def parse(d):
# reached a leave
if not isinstance(d, dict):
return d
out = InfiniteDict()
for path, values in d.items():
# parse the path, if possible
try:
*path, key = path.split("::")
except AttributeError:
# do nothing if path is not a string
path, key = [], path
# follow the path down almost to the end
# noop if path = []
temp = out
for x in path:
temp = temp[x]
# assign it to the last part of the path
# need to parse that as well, in case it has another path
# works only `sys.getrecursionlimit()` levels deep, obviously
temp[key] = parse(values)
return out
For the given example this produces:
InfiniteDict(__main__.InfiniteDict,
{'a': InfiniteDict(__main__.InfiniteDict,
{'test': InfiniteDict(__main__.InfiniteDict,
{'test2': InfiniteDict(__main__.InfiniteDict,
{'test3': InfiniteDict(__main__.InfiniteDict,
{'test4': InfiniteDict(__main__.InfiniteDict,
{'age': 1,
'name': 'ok'})}),
'test4': InfiniteDict(__main__.InfiniteDict,
{'test4': InfiniteDict(__main__.InfiniteDict,
{'age': 2,
'name': 'ok1'}),
'test5': InfiniteDict(__main__.InfiniteDict,
{'body': InfiniteDict(__main__.InfiniteDict,
{'head': InfiniteDict(__main__.InfiniteDict,
{'foot': InfiniteDict(__main__.InfiniteDict,
{'age': 2,
'thing': 'test'})})}),
'name': 'ok1'})})})})}),
'b': InfiniteDict(__main__.InfiniteDict,
{'age': 'test2', 'name': 'ok2'})})
Which looks worse than it is, because InfiniteDict
inherits from dict
in the end:
isinstance(InfiniteDict(), dict)
# True
InfiniteDict.mro()
# [__main__.InfiniteDict, collections.defaultdict, dict, object]
And so you can json.dumps
it, just like you did in your code:
import json
...
if __name__ == "__main__":
mydict = {...}
print(json.dumps(parse(mydict), indent=2))
# {
# "a": {
# "test": {
# "test2": {
# "test3": {
# "test4": {
# "name": "ok",
# "age": 1
# }
# },
# "test4": {
# "test4": {
# "name": "ok1",
# "age": 2
# },
# "test5": {
# "name": "ok1",
# "body": {
# "head": {
# "foot": {
# "age": 2,
# "thing": "test"
# }
# }
# }
# }
# }
# }
# }
# },
# "b": {
# "name": "ok2",
# "age": "test2"
# }
# }
The advantage of this is that the InfiniteDict
deals with most of the nasty recursive stuff, and the only thing left to do is make paths out of the strings, if necessary.
-
\$\begingroup\$ Great solution, it works perfect for converting the pickle file to a mat file. I don't understand the decision for using this strange format since it can't be imported easily into matlab or similar and the guy who is responsible simply said "you have to do the conversion yourself"... \$\endgroup\$Linus– Linus2019年11月26日 14:16:36 +00:00Commented Nov 26, 2019 at 14:16
-
\$\begingroup\$ @Linus Well, it is a slightly compressed version, compared to json. But the gains are not much, especially if you have to parse it manually... \$\endgroup\$Graipher– Graipher2019年11月26日 14:18:29 +00:00Commented Nov 26, 2019 at 14:18
-
1\$\begingroup\$ Yeah, the pickle files are about 140MB big so the compression might be helpful in someway but it's not a easy format to work with. Anyway, thank you for the time and code review :) \$\endgroup\$Linus– Linus2019年11月26日 14:22:26 +00:00Commented Nov 26, 2019 at 14:22
"a::b": {"c": 2}, "a::b::c": {"d": 3}
a possibility and how to deal with it? \$\endgroup\$