Critique my JSON-optimizing Python script

Question 1

Here's a Python script to optimize JSON documents:

#!/usr/bin/python3
"""jopo - Optimize JSON documents."""
from collections import defaultdict
def optimize(data):
 """Optimize a JSON-like data structure."""
 if isinstance(data, list):
 if len(data) == 1:
 return optimize(data[0])
 digest = defaultdict(lambda: defaultdict(set))
 for element in data:
 if not isinstance(element, dict) or len(element) != 2:
 break
 for key, value in element.items():
 digest[key][type(value)].add(repr(value))
 else:
 if len(digest) == 2:
 pool = {k for k in digest if len(digest[k][str]) == len(data)}
 if pool:
 if len(pool) > 1:
 meta = input("Which key (%s)? " % "|".join(pool))
 else:
 meta = next(iter(pool))
 infra = (digest.keys() - {meta}).pop()
 return {d[meta]: optimize(d[infra]) for d in data}
 return [optimize(x) for x in data]
 if isinstance(data, dict):
 if len(data) == 1:
 return optimize(next(iter(data.values())))
 else:
 return {k: optimize(v) for k, v in data.items()}
 return data
if __name__ == "__main__":
 import json
 import sys
 for arg in sys.argv[1:]:
 with open(arg) as f:
 doc = json.load(f)
 print(json.dumps(optimize(doc), indent=2))

... which, given a JSON document recipes.json like this:

{
 "recipes": [
 {
 "name": "pizza", 
 "ingredients": [
 {
 "quantity": "100g", 
 "ingredient": "cheese"
 }, 
 {
 "quantity": "200g", 
 "ingredient": "tomato"
 }
 ]
 }, 
 {
 "name": "pizza 2", 
 "ingredients": [
 {
 "quantity": "300g", 
 "ingredient": "ham"
 }, 
 {
 "quantity": "300g", 
 "ingredient": "pineapple"
 }
 ]
 }
 ]
}

... is used like this:

$ ./jopo.py recipes.json 
Which key (ingredient|quantity)? ingredient
{
 "pizza 2": {
 "ham": "300g", 
 "pineapple": "300g"
 }, 
 "pizza": {
 "tomato": "200g", 
 "cheese": "100g"
 }
}

It does two things:

Collapse 1-item objects and arrays.
Reduce arrays of congruent 2-item objects to a single object, if possible.

I think the optimize() function is both too long and too hard to follow.

Obviously I could separate out the if isinstance(data, <type>): # ... blocks into separate functions, but even then, an optimize_list() function along those lines is still going to be quite long and, IMO, somewhat impenetrable.

How can I improve the legibility of this code?

Question 2

 digest = defaultdict(lambda: defaultdict(set))

It's pretty rare that we have a good reason for such a nested data structure in python.

 for element in data:
 if not isinstance(element, dict) or len(element) != 2:
 break
 for key, value in element.items():
 digest[key][type(value)].add(repr(value))

Why are we adding the repr of the value? I assume you do it to avoid taking the hash of unhashable objects. But what if somebody puts some strange object in your list that has a repr which always returns the same thing. Besides you are only interested in strings.

 else:
 if len(digest) == 2:

This is an unintuive way to check for your requirement. Your requirement is actually that all elements have the same keys. Check for that instead.

 pool = {k for k in digest if len(digest[k][str]) == len(data)}

Your actually checking to see if that key was always a string. Perhaps you should check that more directly.

 if pool:
 if len(pool) > 1:
 meta = input("Which key (%s)? " % "|".join(pool))

Its deeply suspicious if you need to ask the user for instructions.

 else:
 meta = next(iter(pool))
 infra = (digest.keys() - {meta}).pop()
 return {d[meta]: optimize(d[infra]) for d in data}
 return [optimize(x) for x in data]

Here is my reworking of it.

def optimize(data):
 """
 Dispatcher, figure out which algorithm to use to compress the json
 """
 if isinstance(data, list):
 if len(data) == 1:
 return optimize(data[0])
 else:
 keys = determine_key(data)
 if keys is None: # unable to determine tkeys
 return [optimize(element) for element in data]
 else:
 return optimize_list(data, keys)
 elif isinstance(data, dict):
 if len(data) == 1:
 return optimize(next(iter(data.values())))
 else:
 return optimize_dict(data)
 else:
 return data
def optimize_dict(data):
 """
 Optimize a dict, just optimizes each value
 """
 return {key : optimize(value) for key, value in data.items()}
def can_be_key(data, key):
 """
 return true if key could be a key
 """
 if not all( isinstance(element[key], str) for element in data):
 return False
 return len(set(element[key] for element in data)) == len(data)
def determine_key(data):
 """
 Determine the key, value to compress the list of dicts
 or None if no such key exists
 """
 for element in data:
 if not isinstance(element, dict):
 return None
 if len(element) != 2:
 return None
 if element.keys() != data[0].keys():
 return None
 key1, key2 = data[0].keys()
 key1_possible = can_be_key(data, key1)
 key2_possible = can_be_key(data, key2)
 if key1_possible and key2_possible:
 meta = input("Which key (%s|%s)? " % (key1, key2))
 if meta == key1:
 return key1, key2
 else:
 return key2, key1
 elif key1_possible:
 return key1, key2
 elif key2_possible:
 return key2, key1
 else:
 return None
def optimize_list(data, keys):
 key_key, key_value = keys
 return {element[key_key]: optimize(element[key_value]) for element in data}

But... I must question the wisdom of the entire approach. The function is actually rather useless. Suppose that you use it on web service that returns a json object. It gives you

{"results" : [
 {"name" : "foo", "age": 42}
 {"name" : "bar", "age": 36}
}

Which optimize to

{"foo": 42, "age": 36}

So you write:

for name, age in optimize(the_json).items():
 print(name, age)

But then in another request, the server returns

{"results" : [
 {"name" : "foo", "age": 42}
}

which optimizes to:

42

So when you run:

for name, age in optimize(the_json).items():
 print(name, age)

You get an error.

Or the server adds extra data to the json:

{
 "salinity" : "high",
 "results" : [
 {"name" : "foo", "age": 42, "status" : "sad"}
 {"name" : "bar", "age": 36, "status" : "happy"}
}

Which should be backwards compatible using json, but will break your code or anything doing the same job.

This is not to say that can't or shouldn't simplify the json. But don't try to do it by analyzing the data like this. You should do it by some external record of the schema.

class Fetch(object):
 """
 Fetch one element from a dictionary
 """
 def __init__(self, name, inner):
 self._name = name
 self._inner = inner
 def __call__(self, data):
 return self._inner(data[self._name])
class KeyValue(object):
 """
 Given a list of dicts, convert to a dictionary
 with each key being taken from the `key` on each element
 and each value being taken from the `value`
 """
 def __init__(self, key, value, inner):
 self._key = key
 self._value = value
 self._inner = inner
 def __call__(self, data):
 return {element[self._key] : self._inner(element[self._value]) for element in data}
class Literal(object):
 def __call__(self, data):
 return data
schema = Fetch("recipes",
 KeyValue("name", "ingredients",
 KeyValue("ingredient", "quantity", Literal())
 )
)

That's actually simpler then what you were doing, and its much more robust. Its also very flexible as you can write code to perform just about any transformation on the data.

Question 3

Thanks for the detailed critique ... I'll take some time to give it the attention it deserves, but for the moment, re: "I must question the wisdom of the entire approach" - it's not intended to work with live data (as you say, that would be horribly fragile), but as an aid in analysing the design of a JSON document. The idea came from this SO question, and the realisation that I could get a machine to replicate the rules I'd applied in answering it.

Question 4

On your criticisms: I agree the nested defaultdict is less readable than it could be. I'm using it to check complex-ish criteria in as few passes as possible; perhaps I shouldn't. Yes, the repr is for hashability. I'm unaware of any two distinct JSON fragments with the same repr; do you know of any? My requirement is that all objects in the array have the same two keys, so that the array can be converted to an object. I'm checking that the value is always a unique string, so that it can be a key. User input is necessary if both potential keys all have unique string values.

Question 5

Given the purpose of the code - suggest improvements to an implicit schema from example data - I can't really use a nonexistent schema to optimize against. My apologies for not being clearer about that in my question.

Question 6

And ... it turns out I can't put linebreaks in comments. Further apologies for the mess that's made of my response. If code review is a dialogue, it's kinda tricky in the SO format, isn't it?

Question 7

No two JSON fragments will never have the same repr. But somebody could pass something that wasn't json into your code. Its probably not a serious thing to be concerned about. But its a rather odd use of repr.

Winston Ewert Winston Ewert 30.7k4 gold badges52 silver badges79 bronze badges · Accepted Answer · 2013-08-09 03:36:44Z

 digest = defaultdict(lambda: defaultdict(set))

It's pretty rare that we have a good reason for such a nested data structure in python.

 for element in data:
 if not isinstance(element, dict) or len(element) != 2:
 break
 for key, value in element.items():
 digest[key][type(value)].add(repr(value))

Why are we adding the repr of the value? I assume you do it to avoid taking the hash of unhashable objects. But what if somebody puts some strange object in your list that has a repr which always returns the same thing. Besides you are only interested in strings.

 else:
 if len(digest) == 2:

This is an unintuive way to check for your requirement. Your requirement is actually that all elements have the same keys. Check for that instead.

 pool = {k for k in digest if len(digest[k][str]) == len(data)}

Your actually checking to see if that key was always a string. Perhaps you should check that more directly.

 if pool:
 if len(pool) > 1:
 meta = input("Which key (%s)? " % "|".join(pool))

Its deeply suspicious if you need to ask the user for instructions.

 else:
 meta = next(iter(pool))
 infra = (digest.keys() - {meta}).pop()
 return {d[meta]: optimize(d[infra]) for d in data}
 return [optimize(x) for x in data]

Here is my reworking of it.

def optimize(data):
 """
 Dispatcher, figure out which algorithm to use to compress the json
 """
 if isinstance(data, list):
 if len(data) == 1:
 return optimize(data[0])
 else:
 keys = determine_key(data)
 if keys is None: # unable to determine tkeys
 return [optimize(element) for element in data]
 else:
 return optimize_list(data, keys)
 elif isinstance(data, dict):
 if len(data) == 1:
 return optimize(next(iter(data.values())))
 else:
 return optimize_dict(data)
 else:
 return data
def optimize_dict(data):
 """
 Optimize a dict, just optimizes each value
 """
 return {key : optimize(value) for key, value in data.items()}
def can_be_key(data, key):
 """
 return true if key could be a key
 """
 if not all( isinstance(element[key], str) for element in data):
 return False
 return len(set(element[key] for element in data)) == len(data)
def determine_key(data):
 """
 Determine the key, value to compress the list of dicts
 or None if no such key exists
 """
 for element in data:
 if not isinstance(element, dict):
 return None
 if len(element) != 2:
 return None
 if element.keys() != data[0].keys():
 return None
 key1, key2 = data[0].keys()
 key1_possible = can_be_key(data, key1)
 key2_possible = can_be_key(data, key2)
 if key1_possible and key2_possible:
 meta = input("Which key (%s|%s)? " % (key1, key2))
 if meta == key1:
 return key1, key2
 else:
 return key2, key1
 elif key1_possible:
 return key1, key2
 elif key2_possible:
 return key2, key1
 else:
 return None
def optimize_list(data, keys):
 key_key, key_value = keys
 return {element[key_key]: optimize(element[key_value]) for element in data}

But... I must question the wisdom of the entire approach. The function is actually rather useless. Suppose that you use it on web service that returns a json object. It gives you

{"results" : [
 {"name" : "foo", "age": 42}
 {"name" : "bar", "age": 36}
}

Which optimize to

{"foo": 42, "age": 36}

So you write:

for name, age in optimize(the_json).items():
 print(name, age)

But then in another request, the server returns

{"results" : [
 {"name" : "foo", "age": 42}
}

which optimizes to:

42

So when you run:

for name, age in optimize(the_json).items():
 print(name, age)

You get an error.

Or the server adds extra data to the json:

{
 "salinity" : "high",
 "results" : [
 {"name" : "foo", "age": 42, "status" : "sad"}
 {"name" : "bar", "age": 36, "status" : "happy"}
}

Which should be backwards compatible using json, but will break your code or anything doing the same job.

This is not to say that can't or shouldn't simplify the json. But don't try to do it by analyzing the data like this. You should do it by some external record of the schema.

class Fetch(object):
 """
 Fetch one element from a dictionary
 """
 def __init__(self, name, inner):
 self._name = name
 self._inner = inner
 def __call__(self, data):
 return self._inner(data[self._name])
class KeyValue(object):
 """
 Given a list of dicts, convert to a dictionary
 with each key being taken from the `key` on each element
 and each value being taken from the `value`
 """
 def __init__(self, key, value, inner):
 self._key = key
 self._value = value
 self._inner = inner
 def __call__(self, data):
 return {element[self._key] : self._inner(element[self._value]) for element in data}
class Literal(object):
 def __call__(self, data):
 return data
schema = Fetch("recipes",
 KeyValue("name", "ingredients",
 KeyValue("ingredient", "quantity", Literal())
 )
)

That's actually simpler then what you were doing, and its much more robust. Its also very flexible as you can write code to perform just about any transformation on the data.

Thanks for the detailed critique ... I'll take some time to give it the attention it deserves, but for the moment, re: "I must question the wisdom of the entire approach" - it's not intended to work with live data (as you say, that would be horribly fragile), but as an aid in analysing the design of a JSON document. The idea came from this SO question, and the realisation that I could get a machine to replicate the rules I'd applied in answering it.
On your criticisms: I agree the nested defaultdict is less readable than it could be. I'm using it to check complex-ish criteria in as few passes as possible; perhaps I shouldn't. Yes, the repr is for hashability. I'm unaware of any two distinct JSON fragments with the same repr; do you know of any? My requirement is that all objects in the array have the same two keys, so that the array can be converted to an object. I'm checking that the value is always a unique string, so that it can be a key. User input is necessary if both potential keys all have unique string values.
Given the purpose of the code - suggest improvements to an implicit schema from example data - I can't really use a nonexistent schema to optimize against. My apologies for not being clearer about that in my question.
And ... it turns out I can't put linebreaks in comments. Further apologies for the mess that's made of my response. If code review is a dialogue, it's kinda tricky in the SO format, isn't it?
No two JSON fragments will never have the same repr. But somebody could pass something that wasn't json into your code. Its probably not a serious thing to be concerned about. But its a rather odd use of repr.

Stack Exchange Network

Critique my JSON-optimizing Python script

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Critique my JSON-optimizing Python script

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions