I've written a simple python web scraper that parses text from an html table and stores the scraped data in List of dictionaries. The code works and doesn't seem to have any glaring issues performance-wise, but I only used the bare bones modules of lxml and requests.
Is there a more efficient/elegant way to condense the script or to improve the runtime?
The code is below:
import requests
from lxml.html import fromstring
import pprint
import re
url = "https://jobs.mo.gov/content/missouri-warn-notices-py-2017"
response = requests.get(url)
root = fromstring(response.content)
table = root.xpath('.//*[@summary="Missouri WARN Notices PY 2016"]')[0]
tableRes = []
columnHeaders = table.xpath(".//tr//th/span/text()")
for row in table.xpath(".//tr")[1:]:
i = 0
rowDict = {}
for col in row.xpath(".//td"):
if i != 1:
rowDict[columnHeaders[i]] = re.sub(r"[\n\t]*", "","".join(col.xpath(".//text()")).replace(u'\xa0', u' '))
else:
rowDict[columnHeaders[i]] = re.sub(r"[\n\t]*", "","".join(col.xpath(".//a/text()")).replace(u'\xa0', u' '))
i += 1
tableRes.append(rowDict)
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(tableRes)
1 Answer 1
Looked at your script and found a few things that can be improved in terms of code quality as well as performance. On my local machine i was able to cut the runtime in half with the code changes.
The default separator value of
str.split()
is whitespace character(s) (space, tab, newline, return, formfeed). Combining it with a' '.join()
will return a proper string. The unicode chars ('xa0') in your example are also handled by this approach. In general a good way to deal with unicode is to use thenormalize
function from theunicodedata
package.Instead of define and manipulate an index for your loop you can simply use
enumerate
in yourfor
-loops.I prefer to store the values per row in a simple
list
and create thedict
after the loop is done via thezip
function. The performance improvement is minor but i think it improves readability of the code quite a bit.In general you should maybe have a look at the PEP-8 guide - your variable names are not PEP-8 conform.
Improved version:
table_res = []
column_headers = table.xpath(".//tr//th/span/text()")
for row in table.xpath(".//tr")[1:]:
cells = []
for i, cell in enumerate(row.xpath(".//td")):
if i != 1:
cells.append(' '.join(cell.text_content().split()))
else:
cells.append(' '.join(''.join(cell.xpath('.//a/text()')).split()))
table_res.append({k: v for k, v in zip(column_headers, cells)})
When you compare the outputs of both versions you will find some small differences. This is mainly due to the fact that your strings sometimes have a trailing whitespace or the date is missing some whitespace.