Here's a Python script to optimize JSON documents:
#!/usr/bin/python3
"""jopo - Optimize JSON documents."""
from collections import defaultdict
def optimize(data):
"""Optimize a JSON-like data structure."""
if isinstance(data, list):
if len(data) == 1:
return optimize(data[0])
digest = defaultdict(lambda: defaultdict(set))
for element in data:
if not isinstance(element, dict) or len(element) != 2:
break
for key, value in element.items():
digest[key][type(value)].add(repr(value))
else:
if len(digest) == 2:
pool = {k for k in digest if len(digest[k][str]) == len(data)}
if pool:
if len(pool) > 1:
meta = input("Which key (%s)? " % "|".join(pool))
else:
meta = next(iter(pool))
infra = (digest.keys() - {meta}).pop()
return {d[meta]: optimize(d[infra]) for d in data}
return [optimize(x) for x in data]
if isinstance(data, dict):
if len(data) == 1:
return optimize(next(iter(data.values())))
else:
return {k: optimize(v) for k, v in data.items()}
return data
if __name__ == "__main__":
import json
import sys
for arg in sys.argv[1:]:
with open(arg) as f:
doc = json.load(f)
print(json.dumps(optimize(doc), indent=2))
... which, given a JSON document recipes.json
like this:
{
"recipes": [
{
"name": "pizza",
"ingredients": [
{
"quantity": "100g",
"ingredient": "cheese"
},
{
"quantity": "200g",
"ingredient": "tomato"
}
]
},
{
"name": "pizza 2",
"ingredients": [
{
"quantity": "300g",
"ingredient": "ham"
},
{
"quantity": "300g",
"ingredient": "pineapple"
}
]
}
]
}
... is used like this:
$ ./jopo.py recipes.json
Which key (ingredient|quantity)? ingredient
{
"pizza 2": {
"ham": "300g",
"pineapple": "300g"
},
"pizza": {
"tomato": "200g",
"cheese": "100g"
}
}
It does two things:
Collapse 1-item objects and arrays.
Reduce arrays of congruent 2-item objects to a single object, if possible.
I think the optimize()
function is both too long and too hard to follow.
Obviously I could separate out the if isinstance(data, <type>): # ...
blocks into separate functions, but even then, an optimize_list()
function along those lines is still going to be quite long and, IMO, somewhat impenetrable.
How can I improve the legibility of this code?
1 Answer 1
digest = defaultdict(lambda: defaultdict(set))
It's pretty rare that we have a good reason for such a nested data structure in python.
for element in data:
if not isinstance(element, dict) or len(element) != 2:
break
for key, value in element.items():
digest[key][type(value)].add(repr(value))
Why are we adding the repr of the value? I assume you do it to avoid taking the hash of unhashable objects. But what if somebody puts some strange object in your list that has a repr which always returns the same thing. Besides you are only interested in strings.
else:
if len(digest) == 2:
This is an unintuive way to check for your requirement. Your requirement is actually that all elements have the same keys. Check for that instead.
pool = {k for k in digest if len(digest[k][str]) == len(data)}
Your actually checking to see if that key was always a string. Perhaps you should check that more directly.
if pool:
if len(pool) > 1:
meta = input("Which key (%s)? " % "|".join(pool))
Its deeply suspicious if you need to ask the user for instructions.
else:
meta = next(iter(pool))
infra = (digest.keys() - {meta}).pop()
return {d[meta]: optimize(d[infra]) for d in data}
return [optimize(x) for x in data]
Here is my reworking of it.
def optimize(data):
"""
Dispatcher, figure out which algorithm to use to compress the json
"""
if isinstance(data, list):
if len(data) == 1:
return optimize(data[0])
else:
keys = determine_key(data)
if keys is None: # unable to determine tkeys
return [optimize(element) for element in data]
else:
return optimize_list(data, keys)
elif isinstance(data, dict):
if len(data) == 1:
return optimize(next(iter(data.values())))
else:
return optimize_dict(data)
else:
return data
def optimize_dict(data):
"""
Optimize a dict, just optimizes each value
"""
return {key : optimize(value) for key, value in data.items()}
def can_be_key(data, key):
"""
return true if key could be a key
"""
if not all( isinstance(element[key], str) for element in data):
return False
return len(set(element[key] for element in data)) == len(data)
def determine_key(data):
"""
Determine the key, value to compress the list of dicts
or None if no such key exists
"""
for element in data:
if not isinstance(element, dict):
return None
if len(element) != 2:
return None
if element.keys() != data[0].keys():
return None
key1, key2 = data[0].keys()
key1_possible = can_be_key(data, key1)
key2_possible = can_be_key(data, key2)
if key1_possible and key2_possible:
meta = input("Which key (%s|%s)? " % (key1, key2))
if meta == key1:
return key1, key2
else:
return key2, key1
elif key1_possible:
return key1, key2
elif key2_possible:
return key2, key1
else:
return None
def optimize_list(data, keys):
key_key, key_value = keys
return {element[key_key]: optimize(element[key_value]) for element in data}
But... I must question the wisdom of the entire approach. The function is actually rather useless. Suppose that you use it on web service that returns a json object. It gives you
{"results" : [
{"name" : "foo", "age": 42}
{"name" : "bar", "age": 36}
}
Which optimize to
{"foo": 42, "age": 36}
So you write:
for name, age in optimize(the_json).items():
print(name, age)
But then in another request, the server returns
{"results" : [
{"name" : "foo", "age": 42}
}
which optimizes to:
42
So when you run:
for name, age in optimize(the_json).items():
print(name, age)
You get an error.
Or the server adds extra data to the json:
{
"salinity" : "high",
"results" : [
{"name" : "foo", "age": 42, "status" : "sad"}
{"name" : "bar", "age": 36, "status" : "happy"}
}
Which should be backwards compatible using json, but will break your code or anything doing the same job.
This is not to say that can't or shouldn't simplify the json. But don't try to do it by analyzing the data like this. You should do it by some external record of the schema.
class Fetch(object):
"""
Fetch one element from a dictionary
"""
def __init__(self, name, inner):
self._name = name
self._inner = inner
def __call__(self, data):
return self._inner(data[self._name])
class KeyValue(object):
"""
Given a list of dicts, convert to a dictionary
with each key being taken from the `key` on each element
and each value being taken from the `value`
"""
def __init__(self, key, value, inner):
self._key = key
self._value = value
self._inner = inner
def __call__(self, data):
return {element[self._key] : self._inner(element[self._value]) for element in data}
class Literal(object):
def __call__(self, data):
return data
schema = Fetch("recipes",
KeyValue("name", "ingredients",
KeyValue("ingredient", "quantity", Literal())
)
)
That's actually simpler then what you were doing, and its much more robust. Its also very flexible as you can write code to perform just about any transformation on the data.
-
\$\begingroup\$ Thanks for the detailed critique ... I'll take some time to give it the attention it deserves, but for the moment, re: "I must question the wisdom of the entire approach" - it's not intended to work with live data (as you say, that would be horribly fragile), but as an aid in analysing the design of a JSON document. The idea came from this SO question, and the realisation that I could get a machine to replicate the rules I'd applied in answering it. \$\endgroup\$user19131– user191312013年08月09日 11:30:28 +00:00Commented Aug 9, 2013 at 11:30
-
\$\begingroup\$ On your criticisms: I agree the nested defaultdict is less readable than it could be. I'm using it to check complex-ish criteria in as few passes as possible; perhaps I shouldn't. Yes, the
repr
is for hashability. I'm unaware of any two distinct JSON fragments with the samerepr
; do you know of any? My requirement is that all objects in the array have the same two keys, so that the array can be converted to an object. I'm checking that the value is always a unique string, so that it can be a key. User input is necessary if both potential keys all have unique string values. \$\endgroup\$user19131– user191312013年08月13日 15:51:48 +00:00Commented Aug 13, 2013 at 15:51 -
\$\begingroup\$ Given the purpose of the code - suggest improvements to an implicit schema from example data - I can't really use a nonexistent schema to optimize against. My apologies for not being clearer about that in my question. \$\endgroup\$user19131– user191312013年08月13日 15:52:31 +00:00Commented Aug 13, 2013 at 15:52
-
\$\begingroup\$ And ... it turns out I can't put linebreaks in comments. Further apologies for the mess that's made of my response. If code review is a dialogue, it's kinda tricky in the SO format, isn't it? \$\endgroup\$user19131– user191312013年08月13日 15:53:56 +00:00Commented Aug 13, 2013 at 15:53
-
\$\begingroup\$ No two JSON fragments will never have the same repr. But somebody could pass something that wasn't json into your code. Its probably not a serious thing to be concerned about. But its a rather odd use of repr. \$\endgroup\$Winston Ewert– Winston Ewert2013年08月13日 16:04:43 +00:00Commented Aug 13, 2013 at 16:04