Scrape data from website into dataframe(s) using Split function

Question 1

Outline:

This code uses the Split function to extract specific information from the following website: https://www.webscraper.io/test-sites/tables.

The required information are the four tables visible on the page with headers "#", "First Name","Last Name","Username". I am extracting the information within these into 4 dataframes.

Example table:

table1

Description:

I use the requests library to make the GET request, and split the response text on "table table-bordered" to generate my individual table chunks.

There is a fair amount of annoying fiddly indexing to get just the info I want, but the tutorial I am following requires the use of the Split function, and not something far more logical, to my mind, like Beautiful Soup, where I could just apply CSS selectors, for example, and grab what I want. The latter method would be less fragile as well.

I have written a function, GetTable, to parse the required information from each chunk and return a dataframe. There is a difference between the Split delimiter for table 1 versus 2-4.

There isn't an awful lot of code but I would appreciate any pointers on improving the code I have written.

I am running this from Spyder 3.2.8 with Python 3.6.

Code:

def GetTable(tableChunk):
 split1 = tableChunk.split('tbody')[1]
 split2 = split1.split('<table')[0]
 values = []
 aList = split2.split('>\n\t\t\t\t<') 
 if len(aList) !=1:
 for item in aList[1:]:
 values.append(item.split('</')[0].split('d>'[1])[1])
 else:
 aList = split2.split('</td')
 for item in aList[:-1]:
 values.append(item.split('td>')[1])
 headers = ["#", "First Name", "Last Name", "User Name"] 
 numberOfColumns = len(headers)
 numberOfRows = int((len(values) / numberOfColumns))
 df = pd.DataFrame(np.array(values).reshape( numberOfRows, numberOfColumns ) , columns = headers)
 return df
import requests as req
import pandas as pd
import numpy as np
url = "http://webscraper.io/test-sites/tables"
response = req.get(url)
htmlText = response.text 
tableChunks = htmlText.split('table table-bordered')
for tableChunk in tableChunks[1:]:
 print(GetTable(tableChunk))
 print('\n')

Question 2

Don't parse HTML manually, you should use the BeautifulSoup module!
import should be at the top of the file
Use a if __name__ == '__main__' guard
Functions and variable should be snake_case

First you can rewrite the getTable() alot using the BeautifulSoup module

import requests
from bs4 import BeautifulSoup
url = "http://webscraper.io/test-sites/tables"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
for table in soup.select('.table'):
 new_table = [[c.text for c in row.find_all('td')] for row in table.find_all('tr')]

The only problem is that it will also give back None values in the table, so we'd need to catch the None values and only yield when the list is not filled with None

Revised Code

import requests
import pandas as pd
from bs4 import BeautifulSoup
def parse_table(table):
 for row in table.find_all('tr'):
 col = [c.text for c in row.find_all('td')]
 if not all(c is None for c in col):
 yield col
def scrape_tables():
 url = "http://webscraper.io/test-sites/tables"
 soup = BeautifulSoup(requests.get(url).text, 'html.parser')
 for table in soup.select('.table'):
 parsed_table = [col for col in parse_table(table)]
 df = pd.DataFrame(parsed_table, columns=["#", "First Name", "Last Name", "User Name"])
 print()
 print(df)
if __name__ == '__main__':
 scrape_tables()

Question 3

Thank you I will have a proper look later today. The tutorial requires parsing in this manner which I agree is ludicrous. It is not how I woud do it. It is akin to trying to regex your way through it. I do appreciate seeing a BS4 example and will feedback.

Question 4

A tutorial who learns you to use split for html parsing is a bad one, if you don't mind me saying. No point in teaching yourself bad habits Code Horros :)

Question 5

Agreed. I am hoping it will improve. My feedback has not been positive to the organisation concerned so far.

Question 6

Very helpful + 1. I will see if any further points are raised but will get back to you.

Question 7

Accepted this due to the variety of points you address with my code. Thank you for taking the time to review

Question 8

If the table is properly formatted (same column layout) you can do this in one line (read the html and format into a DataFrame):

import pandas as pd
result = pd.read_html("https://www.webscraper.io/test-sites/tables")

Of course there are four tables on this page, so result becomes a list:

 In [7]: for item in result:
 ...: print("\n-------------------------------------")
 ...: print(item)
 ...:
-------------------------------------
 # First Name Last Name Username
0 1 Mark Otto @mdo
1 2 Jacob Thornton @fat
2 3 Larry the Bird @twitter
-------------------------------------
 # First Name Last Name Username
0 4 Harry Potter @hp
1 5 John Snow @dunno
2 6 Tim Bean @timbean
-------------------------------------
 0 1 2 3
0 # First Name Last Name Username
1 1 Mark Otto @mdo
2 2 Jacob Thornton @fat
3 3 Larry the Bird @twitter
-------------------------------------
 0 1 2 3
0 NaN Person User data NaN
1 # First Name Last Name Username
2 - - - -
3 1 Mark Otto @mdo
4 2 Jacob Thornton @fat
5 3 Larry the Bird @twitter

Obviously as the last table has merged cells, the last result is messy.

Question 9

That is awesome. I expected this kind of easy grabbing option from python as I am used to it with other languages (doesn't always work but good to have in the toolbox) and here we go. +1

score 2 · Accepted Answer · 2018-08-08 12:23:36Z

Don't parse HTML manually, you should use the BeautifulSoup module!
import should be at the top of the file
Use a if __name__ == '__main__' guard
Functions and variable should be snake_case

First you can rewrite the getTable() alot using the BeautifulSoup module

import requests
from bs4 import BeautifulSoup
url = "http://webscraper.io/test-sites/tables"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
for table in soup.select('.table'):
 new_table = [[c.text for c in row.find_all('td')] for row in table.find_all('tr')]

The only problem is that it will also give back None values in the table, so we'd need to catch the None values and only yield when the list is not filled with None

Revised Code

import requests
import pandas as pd
from bs4 import BeautifulSoup
def parse_table(table):
 for row in table.find_all('tr'):
 col = [c.text for c in row.find_all('td')]
 if not all(c is None for c in col):
 yield col
def scrape_tables():
 url = "http://webscraper.io/test-sites/tables"
 soup = BeautifulSoup(requests.get(url).text, 'html.parser')
 for table in soup.select('.table'):
 parsed_table = [col for col in parse_table(table)]
 df = pd.DataFrame(parsed_table, columns=["#", "First Name", "Last Name", "User Name"])
 print()
 print(df)
if __name__ == '__main__':
 scrape_tables()

Thank you I will have a proper look later today. The tutorial requires parsing in this manner which I agree is ludicrous. It is not how I woud do it. It is akin to trying to regex your way through it. I do appreciate seeing a BS4 example and will feedback.
A tutorial who learns you to use split for html parsing is a bad one, if you don't mind me saying. No point in teaching yourself bad habits Code Horros :)
Agreed. I am hoping it will improve. My feedback has not been positive to the organisation concerned so far.
Very helpful + 1. I will see if any further points are raised but will get back to you.
Accepted this due to the variety of points you address with my code. Thank you for taking the time to review

Stack Exchange Network

Scrape data from website into dataframe(s) using Split function

2 Answers 2

Revised Code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Scrape data from website into dataframe(s) using Split function

2 Answers 2

Revised Code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions