Outline:
This code uses the Split
function to extract specific information from the following website: https://www.webscraper.io/test-sites/tables.
The required information are the four tables visible on the page with headers "#", "First Name","Last Name","Username"
. I am extracting the information within these into 4 dataframes.
Example table:
Description:
I use the requests
library to make the GET
request, and split the response text on "table table-bordered"
to generate my individual table chunks.
There is a fair amount of annoying fiddly indexing to get just the info I want, but the tutorial I am following requires the use of the Split
function, and not something far more logical, to my mind, like Beautiful Soup, where I could just apply CSS selectors, for example, and grab what I want. The latter method would be less fragile as well.
I have written a function, GetTable
, to parse the required information from each chunk and return a dataframe. There is a difference between the Split
delimiter for table 1 versus 2-4.
There isn't an awful lot of code but I would appreciate any pointers on improving the code I have written.
I am running this from Spyder 3.2.8 with Python 3.6.
Code:
def GetTable(tableChunk):
split1 = tableChunk.split('tbody')[1]
split2 = split1.split('<table')[0]
values = []
aList = split2.split('>\n\t\t\t\t<')
if len(aList) !=1:
for item in aList[1:]:
values.append(item.split('</')[0].split('d>'[1])[1])
else:
aList = split2.split('</td')
for item in aList[:-1]:
values.append(item.split('td>')[1])
headers = ["#", "First Name", "Last Name", "User Name"]
numberOfColumns = len(headers)
numberOfRows = int((len(values) / numberOfColumns))
df = pd.DataFrame(np.array(values).reshape( numberOfRows, numberOfColumns ) , columns = headers)
return df
import requests as req
import pandas as pd
import numpy as np
url = "http://webscraper.io/test-sites/tables"
response = req.get(url)
htmlText = response.text
tableChunks = htmlText.split('table table-bordered')
for tableChunk in tableChunks[1:]:
print(GetTable(tableChunk))
print('\n')
2 Answers 2
- Don't parse HTML manually, you should use the
BeautifulSoup
module! import
should be at the top of the file- Use a
if __name__ == '__main__'
guard - Functions and variable should be
snake_case
First you can rewrite the getTable()
alot using the BeautifulSoup
module
import requests
from bs4 import BeautifulSoup
url = "http://webscraper.io/test-sites/tables"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
for table in soup.select('.table'):
new_table = [[c.text for c in row.find_all('td')] for row in table.find_all('tr')]
The only problem is that it will also give back None values in the table, so we'd need to catch the None values and only yield when the list is not filled with None
Revised Code
import requests
import pandas as pd
from bs4 import BeautifulSoup
def parse_table(table):
for row in table.find_all('tr'):
col = [c.text for c in row.find_all('td')]
if not all(c is None for c in col):
yield col
def scrape_tables():
url = "http://webscraper.io/test-sites/tables"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
for table in soup.select('.table'):
parsed_table = [col for col in parse_table(table)]
df = pd.DataFrame(parsed_table, columns=["#", "First Name", "Last Name", "User Name"])
print()
print(df)
if __name__ == '__main__':
scrape_tables()
-
1\$\begingroup\$ Thank you I will have a proper look later today. The tutorial requires parsing in this manner which I agree is ludicrous. It is not how I woud do it. It is akin to trying to regex your way through it. I do appreciate seeing a BS4 example and will feedback. \$\endgroup\$QHarr– QHarr2018年08月08日 12:27:28 +00:00Commented Aug 8, 2018 at 12:27
-
2\$\begingroup\$ A tutorial who learns you to use
split
for html parsing is a bad one, if you don't mind me saying. No point in teaching yourself bad habits Code Horros :) \$\endgroup\$Ludisposed– Ludisposed2018年08月08日 12:42:40 +00:00Commented Aug 8, 2018 at 12:42 -
1\$\begingroup\$ Agreed. I am hoping it will improve. My feedback has not been positive to the organisation concerned so far. \$\endgroup\$QHarr– QHarr2018年08月08日 14:33:50 +00:00Commented Aug 8, 2018 at 14:33
-
2\$\begingroup\$ Very helpful + 1. I will see if any further points are raised but will get back to you. \$\endgroup\$QHarr– QHarr2018年08月08日 15:17:44 +00:00Commented Aug 8, 2018 at 15:17
-
\$\begingroup\$ Accepted this due to the variety of points you address with my code. Thank you for taking the time to review \$\endgroup\$QHarr– QHarr2018年11月04日 10:51:49 +00:00Commented Nov 4, 2018 at 10:51
If the table is properly formatted (same column layout) you can do this in one line (read the html and format into a DataFrame):
import pandas as pd
result = pd.read_html("https://www.webscraper.io/test-sites/tables")
Of course there are four tables on this page, so result
becomes a list:
In [7]: for item in result:
...: print("\n-------------------------------------")
...: print(item)
...:
-------------------------------------
# First Name Last Name Username
0 1 Mark Otto @mdo
1 2 Jacob Thornton @fat
2 3 Larry the Bird @twitter
-------------------------------------
# First Name Last Name Username
0 4 Harry Potter @hp
1 5 John Snow @dunno
2 6 Tim Bean @timbean
-------------------------------------
0 1 2 3
0 # First Name Last Name Username
1 1 Mark Otto @mdo
2 2 Jacob Thornton @fat
3 3 Larry the Bird @twitter
-------------------------------------
0 1 2 3
0 NaN Person User data NaN
1 # First Name Last Name Username
2 - - - -
3 1 Mark Otto @mdo
4 2 Jacob Thornton @fat
5 3 Larry the Bird @twitter
Obviously as the last table has merged cells, the last result is messy.
-
\$\begingroup\$ That is awesome. I expected this kind of easy grabbing option from python as I am used to it with other languages (doesn't always work but good to have in the toolbox) and here we go. +1 \$\endgroup\$QHarr– QHarr2018年08月09日 05:30:45 +00:00Commented Aug 9, 2018 at 5:30