When screen-scraping some website, I extract data from <script> tags.
The data I get is not in standard JSON format. I cannot use json.loads().
# from
js_obj = '{x:1, y:2, z:3}'
# to
py_obj = {'x':1, 'y':2, 'z':3}
Currently, I use regex to transform the raw data to JSON format.
But I feel pretty bad when I encounter complicated data structure.
Do you have some better solutions?
7 Answers 7
demjson.decode()
import demjson
# from
js_obj = '{x:1, y:2, z:3}'
# to
py_obj = demjson.decode(js_obj)
chompjs.parse_js_object()
import chompjs
# from
js_obj = '{x:1, y:2, z:3}'
# to
py_obj = chompjs.parse_js_object(js_obj)
jsonnet.evaluate_snippet()
import json, _jsonnet
# from
js_obj = '{x:1, y:2, z:3}'
# to
py_obj = json.loads(_jsonnet.evaluate_snippet('snippet', js_obj))
ast.literal_eval()
import ast
# from
js_obj = "{'x':1, 'y':2, 'z':3}"
# to
py_obj = ast.literal_eval(js_obj)
6 Comments
jsonnet I am getting: /home/hafiz031/anaconda3/envs/py38/lib/python3.8/site-packages/_jsonnet.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZSt28__throw_bad_array_new_lengthv Use json5
import json5
js_obj = '{x:1, y:2, z:3}'
py_obj = json5.loads(js_obj)
print(py_obj)
# output
# {'x': 1, 'y': 2, 'z': 3}
2 Comments
If you have node available on the system, you can ask it to evaluate the javascript expression for you, and print the stringified result. The resulting JSON can then be fed to json.loads:
def evaluate_javascript(s):
"""Evaluate and stringify a javascript expression in node.js, and convert the
resulting JSON to a Python object"""
node = Popen(['node', '-'], stdin=PIPE, stdout=PIPE)
stdout, _ = node.communicate(f'console.log(JSON.stringify({s}))'.encode('utf8'))
return json.loads(stdout.decode('utf8'))
1 Comment
This will likely not work everywhere, but as a start, here's a simple regex that should convert the keys into quoted strings so you can pass into json.loads. Or is this what you're already doing?
In[70] : quote_keys_regex = r'([\{\s,])(\w+)(:)'
In[71] : re.sub(quote_keys_regex, r'1円"2円"3円', js_obj)
Out[71]: '{"x":1, "y":2, "z":3}'
In[72] : js_obj_2 = '{x:1, y:2, z:{k:3,j:2}}'
Int[73]: re.sub(quote_keys_regex, r'1円"2円"3円', js_obj_2)
Out[73]: '{"x":1, "y":2, "z":{"k":3,"j":2}}'
Comments
Not including objects
json.loads()
json.loads()doesn't accept undefined, you have to change to nulljson.loads()only accept double quotes{"foo": 1, "bar": null}
Use this if you are sure that your javascript code only have double quotes on key names.
import json
json_text = """{"foo": 1, "bar": undefined}"""
json_text = re.sub(r'("\s*:\s*)undefined(\s*[,}])', '\1円null\2円', json_text)
py_obj = json.loads(json_text)
ast.literal_eval()
ast.literal_eval()doesn't accept undefined, you have to change to Noneast.literal_eval()doesn't accept null, you have to change to Noneast.literal_eval()doesn't accept true, you have to change to Trueast.literal_eval()doesn't accept false, you have to change to Falseast.literal_eval()accept single and double quotes{"foo": 1, "bar": None}or{'foo': 1, 'bar': None}
import ast
js_obj = """{'foo': 1, 'bar': undefined}"""
js_obj = re.sub(r'([\'\"]\s*:\s*)undefined(\s*[,}])', '\1円None\2円', js_obj)
js_obj = re.sub(r'([\'\"]\s*:\s*)null(\s*[,}])', '\1円None\2円', js_obj)
js_obj = re.sub(r'([\'\"]\s*:\s*)NaN(\s*[,}])', '\1円None\2円', js_obj)
js_obj = re.sub(r'([\'\"]\s*:\s*)true(\s*[,}])', '\1円True\2円', js_obj)
js_obj = re.sub(r'([\'\"]\s*:\s*)false(\s*[,}])', '\1円False\2円', js_obj)
py_obj = ast.literal_eval(js_obj)
Comments
Some answers here are outdated, so here's a speed comparison between json5, hjson, and chompjs. (ast.literal_eval and json.loads failed). Evaluated functions on a 1 MB js object to get a good sample. All 3 successes produced an identical dictionary.
# 100.00% - reference time
chompjs.parse_js_object(text)
# 666.65% - 7 times slower
hjson.loads(text)
# 60460.57% - 605 times slower
json5.loads(text)
# fail
ast.literal_eval(text)
json.loads(text)
# won't install on Python 3.11.9
demjson
jsonnet
Comments
Explore related questions
See similar questions with these tags.
Plain old javascript data structureto python object.js_objdidn't have quotes around the keys. How complicated would your data structures get? It's hard to suggest anything without knowing the cases you're trying to solve for.js_objmaybe nested