Dealing with messy JSON API and UTF-8 encoding problems

Question 1

I am using an API that returns a JSON object that has "encoded_polyline" fields that tend to result in errors like this one:

UnicodeEncodeError: 'utf-8' codec can't encode character '\udb6e' in position 2: surrogates not allowed

Since the offending fields don't offer useful data, I decided the best would be to get rid of the fields causing these errors. However, these fields are sometimes stored at the top level of the JSON object but also sometimes stored within an array at either:

jsonobject["points"]["points"]
jsonobject["laps"]["metric"]
jsonobject["laps"]["imperial"]

I think I've finally got something working to hunt down all these fields and delete them, but I have a feeling this could be done much more cleanly.

r = requests.get(url, headers = {'User-Agent':UA})
jsonobject = r.json()
if 'laps' in jsonobject and jsonobject['laps'] is not None:
 if 'imperial' in jsonobject['laps']:
 laps_array = jsonobject['laps']['imperial']
 type = 'imperial'
 if 'metric' in jsonobject['laps']:
 laps_array = jsonobject['laps']['metric']
 type = 'metric'
 if laps_array is not None:
 jsonobject['laps_correct'] = dict()
 jsonobject['laps_correct'][type] = list()
 for document in laps_array:
 #sometimes these document objects turn out to be dicts
 #and sometimes they turn out to be strings
 #even though JSON output is always the same
 #is there a better way to deal with this?
 if document.__class__.__name__ == "dict":
 document2 = document
 else:
 document2 = json.loads(document)
 if 'small_encoded_polyline' in document2 and document2['small_encoded_polyline'] is not None:
 del document2['small_encoded_polyline']
 document = document2
 #I thought this line above would modify the original
 #jsonobject since document is a dictionary so I should
 #be working with a pointer to the original object
 #but inspection of jsonobject reveals this not to be the case
 jsonobject['laps_correct'][type].append(document2)
 del jsonobject['laps']
 jsonobject['laps'] = jsonobject.pop('laps_correct')
# this can't be an elif because sometimes json objects 
# have both "points" and "laps"
if 'points' in jsonobject and jsonobject['points'] is not None:
 if 'points' in jsonobject['points']:
 laps_array = jsonobject['points']['points']
 jsonobject['points_correct'] = dict()
 jsonobject['points_correct']['points'] = list()
 if laps_array is not None:
 for document in laps_array:
 if document.__class__.__name__ == "dict":
 document2 = document
 else:
 document2 = json.loads(document)
 if 'small_encoded_polyline' in document2 and document2['small_encoded_polyline'] is not None:
 del document2['small_encoded_polyline']
 document = document2
 jsonobject['points_correct']['points'].append(document2)
 del jsonobject['points']
 jsonobject['points'] = jsonobject.pop('points_correct')
if 'small_encoded_polyline' in jsonobject and jsonobject['small_encoded_polyline'] is not None:
 del jsonobject['small_encoded_polyline']

My two biggest worries/questions are:

How can I deal with variables that are sometimes typed as dicts and sometimes as strings even though the JSON format appears identical in both cases?
Is it really necessary to delete original key and replace it rather than updating dicts from the original key? This seems slow and clunky.

Question 2

Is it not possible with the API you're using to manually fix up the encoding before parsing it as JSON? (Like in this SO post; admittedly I'm not sure if that is exactly your problem.)

Question 3

Where does r come from? I suspect that that is the root cause of the problem.

Question 4

@200_success r is a requests object (docs.python-requests.org/en/latest)

Question 5

What does r.headers['content-type'] say?

Question 6

@200_success type is: text/html;charset=ISO-8859-1

Question 7

I don't see why del jsonobject[...] would be necessary before reassigning that key. Also, the test in the last two lines could also be simplified to something like jsonobject.pop('small_encoded_polyline', None) if you don't care about that key, no?

Also:

Everything could use more error checking. That is, check expected types and values instead of falling through the ifs ('imperial', 'metric', what happens if it's neither, or both?).
Unless you really prefer list and dict, it's better to use the literal syntax [] and {} (the names could be shadowed by some definition).
The type check for dicts can be better, i.e. isinstance(foo, dict); if you really want only dict, then it's still better to use type(foo) is dict instead of comparing strings. I'd also move that whole if/else into a function, e.g. maybe_load_json or so.
The value of document after the assignments isn't used anywhere, so it's safe to remove, though I'd rather use that variable again instead of introducing document2.
The second branch with 'small_encoded_polyline' seems to have the wrong level of indentation? At least it's different from the first one in that it's only run if the document wasn't a dict. Assuming that it's safe to run anyway I'll change that in the code below.
The pattern if foo in json and json[foo] is not None: could be easier be achieved with if json.get(foo) is not None: as the return value there defaults to None.
I'm not particularly fond of the habit to reuse input data like with 'points_correct' done here. It would be cleaner just to have a list points_correct and assign that to a jsonobject key (and you know, possibly put the calculation into a separate function as well).

I might have messed up some of the logic now, but I hope you get the idea:

def maybe_load_json(document):
 if isinstance(document, dict):
 return document
 return json.loads(document)
def cleaned_small_encoded_polylines(documents):
 result = []
 for document in documents:
 #sometimes these document objects turn out to be dicts
 #and sometimes they turn out to be strings
 #even though JSON output is always the same
 #is there a better way to deal with this?
 document = maybe_load_json(document)
 if document.get('small_encoded_polyline') is not None:
 del document['small_encoded_polyline']
 #I thought this line above would modify the original
 #jsonobject since document is a dictionary so I should
 #be working with a pointer to the original object
 #but inspection of jsonobject reveals this not to be the case
 result.append(document)
 return result
...
 r = requests.get(url, headers = {'User-Agent':UA})
 jsonobject = r.json()
 if jsonobject.get('laps') is not None:
 if 'imperial' in jsonobject['laps']:
 laps_array = jsonobject['laps']['imperial']
 type = 'imperial'
 if 'metric' in jsonobject['laps']:
 laps_array = jsonobject['laps']['metric']
 type = 'metric'
 if laps_array is not None:
 jsonobject['laps'][type] = cleaned_small_encoded_polylines(laps_array)
 # this can't be an elif because sometimes json objects
 # have both "points" and "laps"
 if jsonobject.get('points') is not None:
 if 'points' in jsonobject['points']:
 points_correct = []
 laps_array = jsonobject['points']['points']
 if laps_array is not None:
 jsonobject['points']['points'] = cleaned_small_encoded_polylines(laps_array)
 jsonobject.pop('small_encoded_polyline', None)

Question 8

Agreed on both points.

Question 9

Thanks a lot for your expanded answer. I agree with most of what you say, but this comment confused me: Unless you really prefer list and dict, it's better to use the literal syntax (fast, can't override it). I actually don't know what you mean by that even after looking at your modifications to the code.

Question 10

@sunny Ah sorry, I mean use [] and {} instead of list() and dict() to create an empty list and dictionary.

Question 11

many thanks, so this is good for me to learn the lingo.

ferada feradaferada 11.4k25 silver badges65 bronze badges · Accepted Answer · 2015-06-16 16:41:54Z

I don't see why del jsonobject[...] would be necessary before reassigning that key. Also, the test in the last two lines could also be simplified to something like jsonobject.pop('small_encoded_polyline', None) if you don't care about that key, no?

Also:

Everything could use more error checking. That is, check expected types and values instead of falling through the ifs ('imperial', 'metric', what happens if it's neither, or both?).
Unless you really prefer list and dict, it's better to use the literal syntax [] and {} (the names could be shadowed by some definition).
The type check for dicts can be better, i.e. isinstance(foo, dict); if you really want only dict, then it's still better to use type(foo) is dict instead of comparing strings. I'd also move that whole if/else into a function, e.g. maybe_load_json or so.
The value of document after the assignments isn't used anywhere, so it's safe to remove, though I'd rather use that variable again instead of introducing document2.
The second branch with 'small_encoded_polyline' seems to have the wrong level of indentation? At least it's different from the first one in that it's only run if the document wasn't a dict. Assuming that it's safe to run anyway I'll change that in the code below.
The pattern if foo in json and json[foo] is not None: could be easier be achieved with if json.get(foo) is not None: as the return value there defaults to None.
I'm not particularly fond of the habit to reuse input data like with 'points_correct' done here. It would be cleaner just to have a list points_correct and assign that to a jsonobject key (and you know, possibly put the calculation into a separate function as well).

I might have messed up some of the logic now, but I hope you get the idea:

def maybe_load_json(document):
 if isinstance(document, dict):
 return document
 return json.loads(document)
def cleaned_small_encoded_polylines(documents):
 result = []
 for document in documents:
 #sometimes these document objects turn out to be dicts
 #and sometimes they turn out to be strings
 #even though JSON output is always the same
 #is there a better way to deal with this?
 document = maybe_load_json(document)
 if document.get('small_encoded_polyline') is not None:
 del document['small_encoded_polyline']
 #I thought this line above would modify the original
 #jsonobject since document is a dictionary so I should
 #be working with a pointer to the original object
 #but inspection of jsonobject reveals this not to be the case
 result.append(document)
 return result
...
 r = requests.get(url, headers = {'User-Agent':UA})
 jsonobject = r.json()
 if jsonobject.get('laps') is not None:
 if 'imperial' in jsonobject['laps']:
 laps_array = jsonobject['laps']['imperial']
 type = 'imperial'
 if 'metric' in jsonobject['laps']:
 laps_array = jsonobject['laps']['metric']
 type = 'metric'
 if laps_array is not None:
 jsonobject['laps'][type] = cleaned_small_encoded_polylines(laps_array)
 # this can't be an elif because sometimes json objects
 # have both "points" and "laps"
 if jsonobject.get('points') is not None:
 if 'points' in jsonobject['points']:
 points_correct = []
 laps_array = jsonobject['points']['points']
 if laps_array is not None:
 jsonobject['points']['points'] = cleaned_small_encoded_polylines(laps_array)
 jsonobject.pop('small_encoded_polyline', None)

Thanks a lot for your expanded answer. I agree with most of what you say, but this comment confused me: Unless you really prefer list and dict, it's better to use the literal syntax (fast, can't override it). I actually don't know what you mean by that even after looking at your modifications to the code.
@sunny Ah sorry, I mean use [] and {} instead of list() and dict() to create an empty list and dictionary.

Stack Exchange Network

Dealing with messy JSON API and UTF-8 encoding problems

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Dealing with messy JSON API and UTF-8 encoding problems

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions