I am scraping a webpage that has a bunch of relevant information stored in a javascript variable
response = requests.get('')
r = response.text
inside r, there is a javascript variable that has a bunch of data I want
This is what is returned from the server:
<!DOCTYPE html>
<html>
<head>
....
<script>
var candidate_details_input_string = = '{ ...}'
</script>
....
</head>
</html>
Inside candidate_details_input_string is a bunch of stuff and I use .split() to isolate the list I want
x = r.split('candidate_completed_list\\":')[1].split(']')[0]+']'
However, this returns the javascript string, but I'm in Python. It looks something like this:
x = '[{\\"i_form_name\\":\\"Applicant_Information_Form\\",\\"completed_time\\":\\"2017-02-03T19:12:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-03T19:14:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-05T19:21:00.000Z\\"},{\\"i_form_name\\":\\"Government_Entity_Questions_Form\\",\\"completed_time\\":\\"2018-07-03T00:29:00.000Z\\"}]'
This is a javascript string and normally would JSON.parse(), but can't since I'm scraping it in python.
Is there anyway to turn this into a Python object I can work with? My default answer is do it by hand, replace all of the \\ and switch the ' into "
3 Answers 3
You can load your x variable into a json(dictionary). We need to replace those \ and all is well:
import json
x = '[{\\"i_form_name\\":\\"Applicant_Information_Form\\",\\"completed_time\\":\\"2017-02-03T19:12:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-03T19:14:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-05T19:21:00.000Z\\"},{\\"i_form_name\\":\\"Government_Entity_Questions_Form\\",\\"completed_time\\":\\"2018-07-03T00:29:00.000Z\\"}]'
data = json.loads(x.replace('\\',''))
print(data)
Comments
You can use ast.literal_eval in this case:
data = '''<!DOCTYPE html>
<html>
<head>
....
<script>
var candidate_details_input_string = '{"i_form_name":"Applicant_Information_Form"}';
</script>
....
</head>
</html>'''
import re
from ast import literal_eval
s = re.findall(r'var candidate_details_input_string\s*=\s*\'(.*?\})\s*\'\s*;', data, flags=re.DOTALL)[0]
data = literal_eval(s)
print(data)
Prints:
{'i_form_name': 'Applicant_Information_Form'}
3 Comments
literal_eval(s). SyntaxError: unexpected character after line continuation characterYou're getting JSON back from requests. Try using the built in json library of python, you shouldn't have to do any manual parsing yourself.
import json
import requests
response = requests.get('')
r = todos = json.loads(response.text)
2 Comments
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)json.loads()
{..}, sorry!'{...}'brackets?