I am using an API that returns a JSON object that has "encoded_polyline"
fields that tend to result in errors like this one:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udb6e' in position 2: surrogates not allowed
Since the offending fields don't offer useful data, I decided the best would be to get rid of the fields causing these errors. However, these fields are sometimes stored at the top level of the JSON object but also sometimes stored within an array at either:
jsonobject["points"]["points"]
jsonobject["laps"]["metric"]
jsonobject["laps"]["imperial"]
I think I've finally got something working to hunt down all these fields and delete them, but I have a feeling this could be done much more cleanly.
r = requests.get(url, headers = {'User-Agent':UA})
jsonobject = r.json()
if 'laps' in jsonobject and jsonobject['laps'] is not None:
if 'imperial' in jsonobject['laps']:
laps_array = jsonobject['laps']['imperial']
type = 'imperial'
if 'metric' in jsonobject['laps']:
laps_array = jsonobject['laps']['metric']
type = 'metric'
if laps_array is not None:
jsonobject['laps_correct'] = dict()
jsonobject['laps_correct'][type] = list()
for document in laps_array:
#sometimes these document objects turn out to be dicts
#and sometimes they turn out to be strings
#even though JSON output is always the same
#is there a better way to deal with this?
if document.__class__.__name__ == "dict":
document2 = document
else:
document2 = json.loads(document)
if 'small_encoded_polyline' in document2 and document2['small_encoded_polyline'] is not None:
del document2['small_encoded_polyline']
document = document2
#I thought this line above would modify the original
#jsonobject since document is a dictionary so I should
#be working with a pointer to the original object
#but inspection of jsonobject reveals this not to be the case
jsonobject['laps_correct'][type].append(document2)
del jsonobject['laps']
jsonobject['laps'] = jsonobject.pop('laps_correct')
# this can't be an elif because sometimes json objects
# have both "points" and "laps"
if 'points' in jsonobject and jsonobject['points'] is not None:
if 'points' in jsonobject['points']:
laps_array = jsonobject['points']['points']
jsonobject['points_correct'] = dict()
jsonobject['points_correct']['points'] = list()
if laps_array is not None:
for document in laps_array:
if document.__class__.__name__ == "dict":
document2 = document
else:
document2 = json.loads(document)
if 'small_encoded_polyline' in document2 and document2['small_encoded_polyline'] is not None:
del document2['small_encoded_polyline']
document = document2
jsonobject['points_correct']['points'].append(document2)
del jsonobject['points']
jsonobject['points'] = jsonobject.pop('points_correct')
if 'small_encoded_polyline' in jsonobject and jsonobject['small_encoded_polyline'] is not None:
del jsonobject['small_encoded_polyline']
My two biggest worries/questions are:
- How can I deal with variables that are sometimes typed as dicts and sometimes as strings even though the JSON format appears identical in both cases?
- Is it really necessary to delete original key and replace it rather than updating dicts from the original key? This seems slow and clunky.
1 Answer 1
- I don't see why
del jsonobject[...]
would be necessary before reassigning that key. Also, the test in the last two lines could also be simplified to something likejsonobject.pop('small_encoded_polyline', None)
if you don't care about that key, no?
Also:
- Everything could use more error checking. That is, check expected
types and values instead of falling through the
if
s ('imperial'
,'metric'
, what happens if it's neither, or both?). - Unless you really prefer
list
anddict
, it's better to use the literal syntax[]
and{}
(the names could be shadowed by some definition). - The type check for
dict
s can be better, i.e.isinstance(foo, dict)
; if you really want onlydict
, then it's still better to usetype(foo) is dict
instead of comparing strings. I'd also move that wholeif/else
into a function, e.g.maybe_load_json
or so. - The value of
document
after the assignments isn't used anywhere, so it's safe to remove, though I'd rather use that variable again instead of introducingdocument2
. - The second branch with
'small_encoded_polyline'
seems to have the wrong level of indentation? At least it's different from the first one in that it's only run if the document wasn't adict
. Assuming that it's safe to run anyway I'll change that in the code below. - The pattern
if foo in json and json[foo] is not None:
could be easier be achieved withif json.get(foo) is not None:
as the return value there defaults toNone
. - I'm not particularly fond of the habit to reuse input data like with
'points_correct'
done here. It would be cleaner just to have a listpoints_correct
and assign that to ajsonobject
key (and you know, possibly put the calculation into a separate function as well).
I might have messed up some of the logic now, but I hope you get the idea:
def maybe_load_json(document):
if isinstance(document, dict):
return document
return json.loads(document)
def cleaned_small_encoded_polylines(documents):
result = []
for document in documents:
#sometimes these document objects turn out to be dicts
#and sometimes they turn out to be strings
#even though JSON output is always the same
#is there a better way to deal with this?
document = maybe_load_json(document)
if document.get('small_encoded_polyline') is not None:
del document['small_encoded_polyline']
#I thought this line above would modify the original
#jsonobject since document is a dictionary so I should
#be working with a pointer to the original object
#but inspection of jsonobject reveals this not to be the case
result.append(document)
return result
...
r = requests.get(url, headers = {'User-Agent':UA})
jsonobject = r.json()
if jsonobject.get('laps') is not None:
if 'imperial' in jsonobject['laps']:
laps_array = jsonobject['laps']['imperial']
type = 'imperial'
if 'metric' in jsonobject['laps']:
laps_array = jsonobject['laps']['metric']
type = 'metric'
if laps_array is not None:
jsonobject['laps'][type] = cleaned_small_encoded_polylines(laps_array)
# this can't be an elif because sometimes json objects
# have both "points" and "laps"
if jsonobject.get('points') is not None:
if 'points' in jsonobject['points']:
points_correct = []
laps_array = jsonobject['points']['points']
if laps_array is not None:
jsonobject['points']['points'] = cleaned_small_encoded_polylines(laps_array)
jsonobject.pop('small_encoded_polyline', None)
-
\$\begingroup\$ Agreed on both points. \$\endgroup\$sunny– sunny2015年06月16日 16:51:04 +00:00Commented Jun 16, 2015 at 16:51
-
\$\begingroup\$ Thanks a lot for your expanded answer. I agree with most of what you say, but this comment confused me: Unless you really prefer list and dict, it's better to use the literal syntax (fast, can't override it). I actually don't know what you mean by that even after looking at your modifications to the code. \$\endgroup\$sunny– sunny2015年06月17日 13:31:46 +00:00Commented Jun 17, 2015 at 13:31
-
\$\begingroup\$ @sunny Ah sorry, I mean use
[]
and{}
instead oflist()
anddict()
to create an empty list and dictionary. \$\endgroup\$ferada– ferada2015年06月17日 14:17:03 +00:00Commented Jun 17, 2015 at 14:17 -
\$\begingroup\$ many thanks, so this is good for me to learn the lingo. \$\endgroup\$sunny– sunny2015年06月17日 14:18:31 +00:00Commented Jun 17, 2015 at 14:18
r
come from? I suspect that that is the root cause of the problem. \$\endgroup\$r.headers['content-type']
say? \$\endgroup\$