scraping website with requests, turning javascript variable data into python object

Question 1

I am scraping a webpage that has a bunch of relevant information stored in a javascript variable

response = requests.get('')
r = response.text

inside r, there is a javascript variable that has a bunch of data I want

This is what is returned from the server:

<!DOCTYPE html>
<html>
<head>
....
<script>
 var candidate_details_input_string = = '{ ...}'
</script>
....
</head>
</html>

Inside candidate_details_input_string is a bunch of stuff and I use .split() to isolate the list I want

x = r.split('candidate_completed_list\\":')[1].split(']')[0]+']'

However, this returns the javascript string, but I'm in Python. It looks something like this:

x = '[{\\"i_form_name\\":\\"Applicant_Information_Form\\",\\"completed_time\\":\\"2017-02-03T19:12:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-03T19:14:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-05T19:21:00.000Z\\"},{\\"i_form_name\\":\\"Government_Entity_Questions_Form\\",\\"completed_time\\":\\"2018-07-03T00:29:00.000Z\\"}]'

This is a javascript string and normally would JSON.parse(), but can't since I'm scraping it in python.

Is there anyway to turn this into a Python object I can work with? My default answer is do it by hand, replace all of the \\ and switch the ' into "

Question 2

Can you share the URL? There are various ways how to extract javascript variables from text.

Question 3

its not a publicly accessible url unfortunately :(

Question 4

updated with the <script> tag

Question 5

its actually {..}, sorry!

Question 6

Can you post sample whats inside the '{...}' brackets?

Question 7

You can load your x variable into a json(dictionary). We need to replace those \ and all is well:

import json
x = '[{\\"i_form_name\\":\\"Applicant_Information_Form\\",\\"completed_time\\":\\"2017-02-03T19:12:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-03T19:14:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-05T19:21:00.000Z\\"},{\\"i_form_name\\":\\"Government_Entity_Questions_Form\\",\\"completed_time\\":\\"2018-07-03T00:29:00.000Z\\"}]'
data = json.loads(x.replace('\\',''))
print(data)

Question 8

You can use ast.literal_eval in this case:

data = '''<!DOCTYPE html>
<html>
<head>
....
<script>
 var candidate_details_input_string = '{"i_form_name":"Applicant_Information_Form"}';
</script>
....
</head>
</html>'''
import re
from ast import literal_eval
s = re.findall(r'var candidate_details_input_string\s*=\s*\'(.*?\})\s*\'\s*;', data, flags=re.DOTALL)[0]
data = literal_eval(s)
print(data)

Prints:

{'i_form_name': 'Applicant_Information_Form'}

Question 9

I'm getting an error on the literal_eval(s). SyntaxError: unexpected character after line continuation character

Question 10

@MorganAllen It would help if you post what's inside the string, to adjust the regex appropriately.

Question 11

let me see if i can strip out the confidential information

Question 12

You're getting JSON back from requests. Try using the built in json library of python, you shouldn't have to do any manual parsing yourself.

import json
import requests
response = requests.get('')
r = todos = json.loads(response.text)

Question 13

im getting a string of HTML back from JSON that has some Javascript inside it. I get this error: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Question 14

@Morgan isolate the json string like you've already been doing (or using an html parser to get to the value), then pass it to json.loads()

Prayson W. Daniel 15.8k6 gold badges57 silver badges62 bronze badges · Accepted Answer · 2019-07-19 18:24:21Z

You can load your x variable into a json(dictionary). We need to replace those \ and all is well:

import json
x = '[{\\"i_form_name\\":\\"Applicant_Information_Form\\",\\"completed_time\\":\\"2017-02-03T19:12:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-03T19:14:00.000Z\\"},{\\"i_form_name\\":\\"Voluntary_Self_Identification_of_Disability_template\\",\\"completed_time\\":\\"2017-02-05T19:21:00.000Z\\"},{\\"i_form_name\\":\\"Government_Entity_Questions_Form\\",\\"completed_time\\":\\"2018-07-03T00:29:00.000Z\\"}]'
data = json.loads(x.replace('\\',''))
print(data)

CollectivesTM on Stack Overflow

scraping website with requests, turning javascript variable data into python object

3 Answers 3

Comments

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Comments

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related