Parsing scraped data from html table

Question 1

I've written a simple python web scraper that parses text from an html table and stores the scraped data in List of dictionaries. The code works and doesn't seem to have any glaring issues performance-wise, but I only used the bare bones modules of lxml and requests.

Is there a more efficient/elegant way to condense the script or to improve the runtime?

The code is below:

import requests
from lxml.html import fromstring
import pprint 
import re
url = "https://jobs.mo.gov/content/missouri-warn-notices-py-2017"
response = requests.get(url)
root = fromstring(response.content)
table = root.xpath('.//*[@summary="Missouri WARN Notices PY 2016"]')[0] 
tableRes = []
columnHeaders = table.xpath(".//tr//th/span/text()")
for row in table.xpath(".//tr")[1:]:
 i = 0
 rowDict = {}
 for col in row.xpath(".//td"):
 if i != 1:
 rowDict[columnHeaders[i]] = re.sub(r"[\n\t]*", "","".join(col.xpath(".//text()")).replace(u'\xa0', u' '))
 else:
 rowDict[columnHeaders[i]] = re.sub(r"[\n\t]*", "","".join(col.xpath(".//a/text()")).replace(u'\xa0', u' '))
 i += 1
 tableRes.append(rowDict)
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(tableRes)

Question 2

Looked at your script and found a few things that can be improved in terms of code quality as well as performance. On my local machine i was able to cut the runtime in half with the code changes.

The default separator value of str.split() is whitespace character(s) (space, tab, newline, return, formfeed). Combining it with a ' '.join() will return a proper string. The unicode chars ('xa0') in your example are also handled by this approach. In general a good way to deal with unicode is to use the normalize function from the unicodedata package.
Instead of define and manipulate an index for your loop you can simply use enumerate in your for-loops.
I prefer to store the values per row in a simple list and create the dict after the loop is done via the zip function. The performance improvement is minor but i think it improves readability of the code quite a bit.
In general you should maybe have a look at the PEP-8 guide - your variable names are not PEP-8 conform.

Improved version:

 table_res = []
 column_headers = table.xpath(".//tr//th/span/text()")
 for row in table.xpath(".//tr")[1:]:
 cells = []
 for i, cell in enumerate(row.xpath(".//td")):
 if i != 1:
 cells.append(' '.join(cell.text_content().split()))
 else:
 cells.append(' '.join(''.join(cell.xpath('.//a/text()')).split()))
 table_res.append({k: v for k, v in zip(column_headers, cells)})

When you compare the outputs of both versions you will find some small differences. This is mainly due to the fact that your strings sometimes have a trailing whitespace or the date is missing some whitespace.

RandomDude RandomDude 3722 silver badges10 bronze badges · Accepted Answer · 2018-07-22 19:58:51Z

Looked at your script and found a few things that can be improved in terms of code quality as well as performance. On my local machine i was able to cut the runtime in half with the code changes.

The default separator value of str.split() is whitespace character(s) (space, tab, newline, return, formfeed). Combining it with a ' '.join() will return a proper string. The unicode chars ('xa0') in your example are also handled by this approach. In general a good way to deal with unicode is to use the normalize function from the unicodedata package.
Instead of define and manipulate an index for your loop you can simply use enumerate in your for-loops.
I prefer to store the values per row in a simple list and create the dict after the loop is done via the zip function. The performance improvement is minor but i think it improves readability of the code quite a bit.
In general you should maybe have a look at the PEP-8 guide - your variable names are not PEP-8 conform.

Improved version:

 table_res = []
 column_headers = table.xpath(".//tr//th/span/text()")
 for row in table.xpath(".//tr")[1:]:
 cells = []
 for i, cell in enumerate(row.xpath(".//td")):
 if i != 1:
 cells.append(' '.join(cell.text_content().split()))
 else:
 cells.append(' '.join(''.join(cell.xpath('.//a/text()')).split()))
 table_res.append({k: v for k, v in zip(column_headers, cells)})

When you compare the outputs of both versions you will find some small differences. This is mainly due to the fact that your strings sometimes have a trailing whitespace or the date is missing some whitespace.

Stack Exchange Network

Parsing scraped data from html table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing scraped data from html table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions