I started practicing in web-scraping few days ago. I made this code to extract data from a wikipedia page. There are several tables that classify mountains based on their height. However there is a problem with the size of the matrices. Some of them contain 5 columns while others 4. So I made this algorithm to extract all the names and the attributes of the mountains into separate lists. My approach was to create a length list that contains the number of
<td>
within the <tr>
tags. The algorithm finds which table contains four columns and fills the column in excess (in the case of 5 columns) with NONE. However, I believe that there is a more efficient and more pythonic way to do it especially in the part where I use the find.next()
function repedetly. Any suggestions are welcomed.
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL="https://en.wikipedia.org/wiki/List_of_mountains_by_elevation"
content=requests.get(URL).content
soup=BeautifulSoup(content,'html.parser')
all_tables=soup.find_all("table",{"class":["sortable", "plainrowheaders"]})
mountain_names=[]
metres_KM=[]
metres_FT=[]
range_Mnt=[]
location=[]
lengths=[]
for table in range(len(all_tables)):
x=all_tables[table].find("tr").find_next("tr")
y=x.find_all("td")
lengths.append(len(y))
for row in all_tables[table].find_all("tr"):
try:
mountain_names.append(row.find("td").text)
metres_KM.append(row.find("td").find_next("td").text)
metres_FT.append(row.find("td").find_next("td").find_next("td").text)
if lengths[table]==5:
range_Mnt.append(row.find("td").find_next("td").find_next("td").find_next("td").text)
else:
range_Mnt.append(None)
location.append(row.find("td").find_next("td").find_next("td").find_next("td").find_next("td").text)
except:
pass
-
1\$\begingroup\$ Is the code working as expected? \$\endgroup\$Sᴀᴍ Onᴇᴌᴀ– Sᴀᴍ Onᴇᴌᴀ ♦2018年06月25日 23:17:03 +00:00Commented Jun 25, 2018 at 23:17
-
\$\begingroup\$ Yes, totally. However i want to find -out a better way to scrape tables rather than using find_next() all the time. \$\endgroup\$brain_dead_cow– brain_dead_cow2018年06月25日 23:18:39 +00:00Commented Jun 25, 2018 at 23:18
-
3\$\begingroup\$ Alright; By the way Welcome to Code Review. Hopefully you receive good answers! \$\endgroup\$Sᴀᴍ Onᴇᴌᴀ– Sᴀᴍ Onᴇᴌᴀ ♦2018年06月25日 23:19:53 +00:00Commented Jun 25, 2018 at 23:19
-
1\$\begingroup\$ Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers. \$\endgroup\$Sᴀᴍ Onᴇᴌᴀ– Sᴀᴍ Onᴇᴌᴀ ♦2018年06月26日 14:59:42 +00:00Commented Jun 26, 2018 at 14:59
1 Answer 1
You're just looping on the rows, but not on the cells:
for row in all_tables[table].find_all("tr"):
Rather than using multiple find_next("td")
one after the other, add another loop using row.find_all('td')
and append each row and cell to a 2D array.
Manipulating a 2D array is much easier and will make your code look much cleaner than row.find("td").find_next("td").find_next("td")
.
Good luck!
Those questions contain some answers that might interest you:
To be more specific, this code snippet from @shaktimaan:
data = []
table = soup.find('table', attrs={'class':'lineItemsTable'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
-
\$\begingroup\$ Thank you for your reply. Because I am new in scraping and in python generally what I understand is that you mean to replace the (try) part of my code with this loop? I did it by the way but data list is empty. \$\endgroup\$brain_dead_cow– brain_dead_cow2018年06月26日 11:57:00 +00:00Commented Jun 26, 2018 at 11:57
Explore related questions
See similar questions with these tags.