4
\$\begingroup\$

Outline:

This code uses the Split function to extract specific information from the following website: https://www.webscraper.io/test-sites/tables.

The required information are the four tables visible on the page with headers "#", "First Name","Last Name","Username". I am extracting the information within these into 4 dataframes.


Example table:

table1


Description:

I use the requests library to make the GET request, and split the response text on "table table-bordered" to generate my individual table chunks.

There is a fair amount of annoying fiddly indexing to get just the info I want, but the tutorial I am following requires the use of the Split function, and not something far more logical, to my mind, like Beautiful Soup, where I could just apply CSS selectors, for example, and grab what I want. The latter method would be less fragile as well.

I have written a function, GetTable, to parse the required information from each chunk and return a dataframe. There is a difference between the Split delimiter for table 1 versus 2-4.

There isn't an awful lot of code but I would appreciate any pointers on improving the code I have written.

I am running this from Spyder 3.2.8 with Python 3.6.


Code:

def GetTable(tableChunk):
 split1 = tableChunk.split('tbody')[1]
 split2 = split1.split('<table')[0]
 values = []
 aList = split2.split('>\n\t\t\t\t<') 
 if len(aList) !=1:
 for item in aList[1:]:
 values.append(item.split('</')[0].split('d>'[1])[1])
 else:
 aList = split2.split('</td')
 for item in aList[:-1]:
 values.append(item.split('td>')[1])
 headers = ["#", "First Name", "Last Name", "User Name"] 
 numberOfColumns = len(headers)
 numberOfRows = int((len(values) / numberOfColumns))
 df = pd.DataFrame(np.array(values).reshape( numberOfRows, numberOfColumns ) , columns = headers)
 return df
import requests as req
import pandas as pd
import numpy as np
url = "http://webscraper.io/test-sites/tables"
response = req.get(url)
htmlText = response.text 
tableChunks = htmlText.split('table table-bordered')
for tableChunk in tableChunks[1:]:
 print(GetTable(tableChunk))
 print('\n')
301_Moved_Permanently
29.4k3 gold badges48 silver badges98 bronze badges
asked Aug 8, 2018 at 6:58
\$\endgroup\$

2 Answers 2

2
\$\begingroup\$
  1. Don't parse HTML manually, you should use the BeautifulSoup module!
  2. import should be at the top of the file
  3. Use a if __name__ == '__main__' guard
  4. Functions and variable should be snake_case

First you can rewrite the getTable() alot using the BeautifulSoup module

import requests
from bs4 import BeautifulSoup
url = "http://webscraper.io/test-sites/tables"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
for table in soup.select('.table'):
 new_table = [[c.text for c in row.find_all('td')] for row in table.find_all('tr')]

The only problem is that it will also give back None values in the table, so we'd need to catch the None values and only yield when the list is not filled with None

Revised Code

import requests
import pandas as pd
from bs4 import BeautifulSoup
def parse_table(table):
 for row in table.find_all('tr'):
 col = [c.text for c in row.find_all('td')]
 if not all(c is None for c in col):
 yield col
def scrape_tables():
 url = "http://webscraper.io/test-sites/tables"
 soup = BeautifulSoup(requests.get(url).text, 'html.parser')
 for table in soup.select('.table'):
 parsed_table = [col for col in parse_table(table)]
 df = pd.DataFrame(parsed_table, columns=["#", "First Name", "Last Name", "User Name"])
 print()
 print(df)
if __name__ == '__main__':
 scrape_tables()
answered Aug 8, 2018 at 12:23
\$\endgroup\$
5
  • 1
    \$\begingroup\$ Thank you I will have a proper look later today. The tutorial requires parsing in this manner which I agree is ludicrous. It is not how I woud do it. It is akin to trying to regex your way through it. I do appreciate seeing a BS4 example and will feedback. \$\endgroup\$ Commented Aug 8, 2018 at 12:27
  • 2
    \$\begingroup\$ A tutorial who learns you to use split for html parsing is a bad one, if you don't mind me saying. No point in teaching yourself bad habits Code Horros :) \$\endgroup\$ Commented Aug 8, 2018 at 12:42
  • 1
    \$\begingroup\$ Agreed. I am hoping it will improve. My feedback has not been positive to the organisation concerned so far. \$\endgroup\$ Commented Aug 8, 2018 at 14:33
  • 2
    \$\begingroup\$ Very helpful + 1. I will see if any further points are raised but will get back to you. \$\endgroup\$ Commented Aug 8, 2018 at 15:17
  • \$\begingroup\$ Accepted this due to the variety of points you address with my code. Thank you for taking the time to review \$\endgroup\$ Commented Nov 4, 2018 at 10:51
2
\$\begingroup\$

If the table is properly formatted (same column layout) you can do this in one line (read the html and format into a DataFrame):

import pandas as pd
result = pd.read_html("https://www.webscraper.io/test-sites/tables")

Of course there are four tables on this page, so result becomes a list:

 In [7]: for item in result:
 ...: print("\n-------------------------------------")
 ...: print(item)
 ...:
-------------------------------------
 # First Name Last Name Username
0 1 Mark Otto @mdo
1 2 Jacob Thornton @fat
2 3 Larry the Bird @twitter
-------------------------------------
 # First Name Last Name Username
0 4 Harry Potter @hp
1 5 John Snow @dunno
2 6 Tim Bean @timbean
-------------------------------------
 0 1 2 3
0 # First Name Last Name Username
1 1 Mark Otto @mdo
2 2 Jacob Thornton @fat
3 3 Larry the Bird @twitter
-------------------------------------
 0 1 2 3
0 NaN Person User data NaN
1 # First Name Last Name Username
2 - - - -
3 1 Mark Otto @mdo
4 2 Jacob Thornton @fat
5 3 Larry the Bird @twitter

Obviously as the last table has merged cells, the last result is messy.

answered Aug 9, 2018 at 5:20
\$\endgroup\$
1
  • \$\begingroup\$ That is awesome. I expected this kind of easy grabbing option from python as I am used to it with other languages (doesn't always work but good to have in the toolbox) and here we go. +1 \$\endgroup\$ Commented Aug 9, 2018 at 5:30

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.