Python scraping tables

Question 1

I started practicing in web-scraping few days ago. I made this code to extract data from a wikipedia page. There are several tables that classify mountains based on their height. However there is a problem with the size of the matrices. Some of them contain 5 columns while others 4. So I made this algorithm to extract all the names and the attributes of the mountains into separate lists. My approach was to create a length list that contains the number of <td> within the <tr> tags. The algorithm finds which table contains four columns and fills the column in excess (in the case of 5 columns) with NONE. However, I believe that there is a more efficient and more pythonic way to do it especially in the part where I use the find.next() function repedetly. Any suggestions are welcomed.

import requests
from bs4 import BeautifulSoup
import pandas as pd
URL="https://en.wikipedia.org/wiki/List_of_mountains_by_elevation"
content=requests.get(URL).content
soup=BeautifulSoup(content,'html.parser')
all_tables=soup.find_all("table",{"class":["sortable", "plainrowheaders"]})
mountain_names=[]
metres_KM=[]
metres_FT=[]
range_Mnt=[]
location=[]
lengths=[]
for table in range(len(all_tables)):
 x=all_tables[table].find("tr").find_next("tr")
 y=x.find_all("td")
 lengths.append(len(y)) 
 for row in all_tables[table].find_all("tr"):
 try:
 mountain_names.append(row.find("td").text)
 metres_KM.append(row.find("td").find_next("td").text)
 metres_FT.append(row.find("td").find_next("td").find_next("td").text)
 if lengths[table]==5:
 range_Mnt.append(row.find("td").find_next("td").find_next("td").find_next("td").text)
 else:
 range_Mnt.append(None)
 location.append(row.find("td").find_next("td").find_next("td").find_next("td").find_next("td").text)
 except:
 pass

Question 2

Is the code working as expected?

Question 3

Yes, totally. However i want to find -out a better way to scrape tables rather than using find_next() all the time.

Question 4

Alright; By the way Welcome to Code Review. Hopefully you receive good answers!

Question 5

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers.

Question 6

You're just looping on the rows, but not on the cells:

 for row in all_tables[table].find_all("tr"):

Rather than using multiple find_next("td") one after the other, add another loop using row.find_all('td') and append each row and cell to a 2D array.

Manipulating a 2D array is much easier and will make your code look much cleaner than row.find("td").find_next("td").find_next("td").

Good luck!

Those questions contain some answers that might interest you:

To be more specific, this code snippet from @shaktimaan:

data = []
table = soup.find('table', attrs={'class':'lineItemsTable'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
 cols = row.find_all('td')
 cols = [ele.text.strip() for ele in cols]
 data.append([ele for ele in cols if ele])

Question 7

Thank you for your reply. Because I am new in scraping and in python generally what I understand is that you mean to replace the (try) part of my code with this loop? I did it by the way but data list is empty.

TwiN TwiN 4532 silver badges10 bronze badges · Answer 1 · 2018-06-26 02:58:43Z

You're just looping on the rows, but not on the cells:

 for row in all_tables[table].find_all("tr"):

Rather than using multiple find_next("td") one after the other, add another loop using row.find_all('td') and append each row and cell to a 2D array.

Manipulating a 2D array is much easier and will make your code look much cleaner than row.find("td").find_next("td").find_next("td").

Good luck!

Those questions contain some answers that might interest you:

To be more specific, this code snippet from @shaktimaan:

data = []
table = soup.find('table', attrs={'class':'lineItemsTable'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
 cols = row.find_all('td')
 cols = [ele.text.strip() for ele in cols]
 data.append([ele for ele in cols if ele])

Thank you for your reply. Because I am new in scraping and in python generally what I understand is that you mean to replace the (try) part of my code with this loop? I did it by the way but data list is empty.

Stack Exchange Network

Python scraping tables

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python scraping tables

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions