Web scraping data.cdc.gov for COVID-19 Data with Selenium in Python

Question 1

I'm attempting to scrape data.cdc.gov for their COVID-19 information on cases and deaths.

The problem that I'm having is that the code seems to be very inefficient. It takes an extremely long time for the code to work. For some reason the CDC's XML file doesn't work at all, and the API is incomplete. I need all of the information about Covid-19 starting from January 22, 2020, up until now. However, the API just doesn't contain all of the information for all of those days. Please someone assist me in making this code more efficient so that I can more seamlessly extract the information that I need.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
options = Options()
options.add_argument('--no-sandbox')
url = 'https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36/data'
driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\chromedriver.exe",options=options)
driver.implicitly_wait(10)
driver.get(url)
while True:
 rows = driver.find_elements_by_xpath("//div[contains(@class, 'socrata-table frozen-columns')]") 
 covid_fin = []
 for table in rows: 
 headers = []
 for head in table.find_elements_by_xpath('//*[@id="renderTypeContainer"]/div[4]/div[2]/div/div[4]/div[1]/div/table/thead/tr/th'):
 headers.append(head.text) 
 for row in table.find_elements_by_xpath('//*[@id="renderTypeContainer"]/div[4]/div[2]/div/div[4]/div[1]/div/table/tbody/tr'):
 covid = []
 for col in row.find_elements_by_xpath("./*[name()='td']"):
 covid.append(col.text)
 if covid:
 covid_dict = {headers[i]: covid[i] for i in 
 range(len(headers))}
 covid_fin.append(covid_dict)
 try:
 WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, 'pager-button-next'))).click()
 time.sleep(5)
 except:
 break

Question 2

@Mast okay, so I just uninstalled version 3.9.1, downloaded 3.10.1, and restarted my computer. Lastly I just reopened Jupyter and reran my code. And it's working the same way. Do you have any other tips? Thanks again for reminding me to update in any event.

Question 3

You're running this in a Jupyter notebook? All in the same code-block?

Question 4

@Mast yes I am.

Question 5

@Mast I just separated the loop out of the first part of the code into a different cell and ran it. It's still running slowly.

Question 6

You might be interested in the process, with Python code included, described at Federal COVID Data in a Single Stream, as there is some somewhat complex nuance to working with the CDC data and matching it up to other federal datasets, like COVID testing and hospitalizations.

Question 7

In my opinion, Selenium isn't the right tool for web scraping much (probably, most) of the time. It turns out that even when websites use javascript, you can usually figure out what that js is doing by using your browser's inspect network.

If you open inspector (ctrl-shift-I in Chrome), then open the initial url you'll see all these requests with the preview to the right. One trick is to just click on all the requests looking at the preview until you see something that looks like the data you want. The first "data" page turns out not to have any data.

inspector1

If you go down a little ways, you'll find the data.

inspector2

Once you find the data, go back to the Headers of the inspector where you can get the URL of the data.

inspector3

Let's copy and paste that into a script

dataurl="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20limit%20100"

Now, on the site, let's click Next and see what happens (well I already did that before doing the screenshots so you can see what happened next already). If you get the URLs from those requests you'll start to see a pattern...

dataurl= "https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20limit%20100"
dataurl2="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20offset%20100%20limit%20100"
dataurl3="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20offset%20200%20limit%20100"

In the first one, there is a select with some jibberish followed by a limit of 100. In the next ones, that select jibberish and the limit of 100 stayed the same by now there's an offset. Now we can just do...

import pandas as pd
import requests
df=[]
i=0
while True:
 if i==0:
 offset=""
 else:
 offset=f"%20offset%20{i}00"
 url=f"https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid{offset}%20limit%20100"
 temp=pd.read_json(requests.get(url).text)
 if temp.shape[0]>0:
 df.append(pd.read_json(requests.get(url).text))
 i+=1
 else:
 break
df=pd.concat(df)

On my computer, this ran in about 4min.

Question 8

This worked. And I'm not sure how I'm supposed to feel about this. I've been working so hard on this. Your code is so much shorter. Thank you.

Question 9

Interesting. Do you think the site could be vulnerable to SQL injections ? o:)

Question 10

@Dean MacGregor, if you don't mind. Could you please share a link to some documentation containing information about your process?

Question 11

"guess, check, google, repeat" I have nothing more substantive than that

Question 12

Okay. Thanks for your help.

Question 13

Don't scrape. Delete all of your code. Go to that page and download one of the export types. XML is richer and has more fields, but CSV is more compact.

Question 14

Ha. Until I saw your answer I assumed the export button didn't work for some reason.

Question 15

@Reinderien I'm scraping to show my capability to do so for my portfolio. And the XML is corrupted.

Question 16

In my opinion, scraping something that is a poor use case for scraping is not a great portfolio entry. There are plenty of other sites that require scraping that would be a better fit.

Dean MacGregor Dean MacGregor 1563 bronze badges · Accepted Answer · 2021-12-09 19:34:17Z

In my opinion, Selenium isn't the right tool for web scraping much (probably, most) of the time. It turns out that even when websites use javascript, you can usually figure out what that js is doing by using your browser's inspect network.

If you open inspector (ctrl-shift-I in Chrome), then open the initial url you'll see all these requests with the preview to the right. One trick is to just click on all the requests looking at the preview until you see something that looks like the data you want. The first "data" page turns out not to have any data.

inspector1

If you go down a little ways, you'll find the data.

inspector2

Once you find the data, go back to the Headers of the inspector where you can get the URL of the data.

inspector3

Let's copy and paste that into a script

dataurl="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20limit%20100"

Now, on the site, let's click Next and see what happens (well I already did that before doing the screenshots so you can see what happened next already). If you get the URLs from those requests you'll start to see a pattern...

dataurl= "https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20limit%20100"
dataurl2="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20offset%20100%20limit%20100"
dataurl3="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20offset%20200%20limit%20100"

In the first one, there is a select with some jibberish followed by a limit of 100. In the next ones, that select jibberish and the limit of 100 stayed the same by now there's an offset. Now we can just do...

import pandas as pd
import requests
df=[]
i=0
while True:
 if i==0:
 offset=""
 else:
 offset=f"%20offset%20{i}00"
 url=f"https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid{offset}%20limit%20100"
 temp=pd.read_json(requests.get(url).text)
 if temp.shape[0]>0:
 df.append(pd.read_json(requests.get(url).text))
 i+=1
 else:
 break
df=pd.concat(df)

On my computer, this ran in about 4min.

This worked. And I'm not sure how I'm supposed to feel about this. I've been working so hard on this. Your code is so much shorter. Thank you.
Interesting. Do you think the site could be vulnerable to SQL injections ? o:)
@Dean MacGregor, if you don't mind. Could you please share a link to some documentation containing information about your process?
"guess, check, google, repeat" I have nothing more substantive than that

Stack Exchange Network

Web scraping data.cdc.gov for COVID-19 Data with Selenium in Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Web scraping data.cdc.gov for COVID-19 Data with Selenium in Python

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions