I'm attempting to scrape data.cdc.gov for their COVID-19 information on cases and deaths.
The problem that I'm having is that the code seems to be very inefficient. It takes an extremely long time for the code to work. For some reason the CDC's XML file doesn't work at all, and the API is incomplete. I need all of the information about Covid-19 starting from January 22, 2020, up until now. However, the API just doesn't contain all of the information for all of those days. Please someone assist me in making this code more efficient so that I can more seamlessly extract the information that I need.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
options = Options()
options.add_argument('--no-sandbox')
url = 'https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36/data'
driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\chromedriver.exe",options=options)
driver.implicitly_wait(10)
driver.get(url)
while True:
rows = driver.find_elements_by_xpath("//div[contains(@class, 'socrata-table frozen-columns')]")
covid_fin = []
for table in rows:
headers = []
for head in table.find_elements_by_xpath('//*[@id="renderTypeContainer"]/div[4]/div[2]/div/div[4]/div[1]/div/table/thead/tr/th'):
headers.append(head.text)
for row in table.find_elements_by_xpath('//*[@id="renderTypeContainer"]/div[4]/div[2]/div/div[4]/div[1]/div/table/tbody/tr'):
covid = []
for col in row.find_elements_by_xpath("./*[name()='td']"):
covid.append(col.text)
if covid:
covid_dict = {headers[i]: covid[i] for i in
range(len(headers))}
covid_fin.append(covid_dict)
try:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, 'pager-button-next'))).click()
time.sleep(5)
except:
break
-
\$\begingroup\$ @Mast okay, so I just uninstalled version 3.9.1, downloaded 3.10.1, and restarted my computer. Lastly I just reopened Jupyter and reran my code. And it's working the same way. Do you have any other tips? Thanks again for reminding me to update in any event. \$\endgroup\$Nini– Nini2021年12月09日 17:00:07 +00:00Commented Dec 9, 2021 at 17:00
-
\$\begingroup\$ You're running this in a Jupyter notebook? All in the same code-block? \$\endgroup\$Mast– Mast ♦2021年12月09日 17:01:13 +00:00Commented Dec 9, 2021 at 17:01
-
\$\begingroup\$ @Mast yes I am. \$\endgroup\$Nini– Nini2021年12月09日 17:06:35 +00:00Commented Dec 9, 2021 at 17:06
-
\$\begingroup\$ @Mast I just separated the loop out of the first part of the code into a different cell and ran it. It's still running slowly. \$\endgroup\$Nini– Nini2021年12月09日 17:12:53 +00:00Commented Dec 9, 2021 at 17:12
-
2\$\begingroup\$ You might be interested in the process, with Python code included, described at Federal COVID Data in a Single Stream, as there is some somewhat complex nuance to working with the CDC data and matching it up to other federal datasets, like COVID testing and hospitalizations. \$\endgroup\$Zach Lipton– Zach Lipton2021年12月10日 03:09:30 +00:00Commented Dec 10, 2021 at 3:09
2 Answers 2
In my opinion, Selenium isn't the right tool for web scraping much (probably, most) of the time. It turns out that even when websites use javascript, you can usually figure out what that js is doing by using your browser's inspect network.
If you open inspector (ctrl-shift-I in Chrome), then open the initial url you'll see all these requests with the preview to the right. One trick is to just click on all the requests looking at the preview until you see something that looks like the data you want. The first "data" page turns out not to have any data.
If you go down a little ways, you'll find the data.
Once you find the data, go back to the Headers of the inspector where you can get the URL of the data.
Let's copy and paste that into a script
dataurl="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20limit%20100"
Now, on the site, let's click Next and see what happens (well I already did that before doing the screenshots so you can see what happened next already). If you get the URLs from those requests you'll start to see a pattern...
dataurl= "https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20limit%20100"
dataurl2="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20offset%20100%20limit%20100"
dataurl3="https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid%20offset%20200%20limit%20100"
In the first one, there is a select with some jibberish followed by a limit of 100. In the next ones, that select jibberish and the limit of 100 stayed the same by now there's an offset. Now we can just do...
import pandas as pd
import requests
df=[]
i=0
while True:
if i==0:
offset=""
else:
offset=f"%20offset%20{i}00"
url=f"https://data.cdc.gov/api/id/9mfq-cb36.json?$query=select%20*%2C%20%3Aid{offset}%20limit%20100"
temp=pd.read_json(requests.get(url).text)
if temp.shape[0]>0:
df.append(pd.read_json(requests.get(url).text))
i+=1
else:
break
df=pd.concat(df)
On my computer, this ran in about 4min.
-
\$\begingroup\$ This worked. And I'm not sure how I'm supposed to feel about this. I've been working so hard on this. Your code is so much shorter. Thank you. \$\endgroup\$Nini– Nini2021年12月09日 20:20:03 +00:00Commented Dec 9, 2021 at 20:20
-
\$\begingroup\$ Interesting. Do you think the site could be vulnerable to SQL injections ? o:) \$\endgroup\$Kate– Kate2021年12月09日 20:42:24 +00:00Commented Dec 9, 2021 at 20:42
-
\$\begingroup\$ @Dean MacGregor, if you don't mind. Could you please share a link to some documentation containing information about your process? \$\endgroup\$Nini– Nini2021年12月09日 22:01:08 +00:00Commented Dec 9, 2021 at 22:01
-
3\$\begingroup\$ "guess, check, google, repeat" I have nothing more substantive than that \$\endgroup\$Dean MacGregor– Dean MacGregor2021年12月09日 22:05:04 +00:00Commented Dec 9, 2021 at 22:05
-
\$\begingroup\$ Okay. Thanks for your help. \$\endgroup\$Nini– Nini2021年12月10日 01:24:57 +00:00Commented Dec 10, 2021 at 1:24
Don't scrape. Delete all of your code. Go to that page and download one of the export types. XML is richer and has more fields, but CSV is more compact.
-
2\$\begingroup\$ Ha. Until I saw your answer I assumed the export button didn't work for some reason. \$\endgroup\$Dean MacGregor– Dean MacGregor2021年12月09日 19:36:44 +00:00Commented Dec 9, 2021 at 19:36
-
\$\begingroup\$ @Reinderien I'm scraping to show my capability to do so for my portfolio. And the XML is corrupted. \$\endgroup\$Nini– Nini2021年12月09日 19:40:11 +00:00Commented Dec 9, 2021 at 19:40
-
5\$\begingroup\$ In my opinion, scraping something that is a poor use case for scraping is not a great portfolio entry. There are plenty of other sites that require scraping that would be a better fit. \$\endgroup\$Reinderien– Reinderien2021年12月09日 19:47:10 +00:00Commented Dec 9, 2021 at 19:47
Explore related questions
See similar questions with these tags.