Cleaner way of appending data to List in BeautifulSoup

Question 1

So I've been experimenting various way to get data from different variety of website; as such, between the use of JSON or BeautifulSoup. Currently, I have written a scraper to collect data such as [{Title,Description,Replies,Topic_Starter, Total_Views}]; but it pretty much has no reusable code. I've been figuring out how to correct my approach of appending data to one singular list for simplicity and reusability. But I've pretty much hit a stone with my current capability.

from requests import get
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
url = 'https://forum.lowyat.net/ReviewsandGuides'
list_topic = []
list_description = []
list_replies = []
list_topicStarted = []
list_totalViews = []
def getContentFromURL(_url):
 try:
 response = get(_url)
 html_soup = BeautifulSoup(response.text, 'lxml')
 return html_soup
 except Exception as e:
 print('Error.getContentFromURL:', e)
 return None
def iterateThroughPages(_lastindexpost, _postperpage, _url):
 indices = '/+'
 index = 0
 for i in range(index, _lastindexpost):
 print('Getting data from ' + url)
 try:
 extractDataFromRow1(getContentFromURL(_url))
 extractDataFromRow2(getContentFromURL(_url))
 print('current page index is: ' + str(index))
 print(_url)
 while i <= _lastindexpost:
 for table in get(_url):
 if table != None:
 new_getPostPerPage = i + _postperpage
 newlink = f'{url}{indices}{new_getPostPerPage}'
 print(newlink)
 bs_link = getContentFromURL(newlink)
 extractDataFromRow1(bs_link)
 extractDataFromRow2(bs_link)
 # threading to prevent spam. Waits 0.5 secs before executing
 sleep(0.5)
 i += _postperpage
 print('current page index is: ' + str(i))
 if i > _lastindexpost:
 # If i gets more than the input page(etc 1770) halts
 print('No more available post to retrieve')
 return
 except Exception as e:
 print('Error.iterateThroughPages:', e)
 return None
def extractDataFromRow1(_url):
 try:
 for container in _url.find_all('td', {'class': 'row1', 'valign': 'middle'}):
 # get data from topic title in table cell
 topic = container.select_one(
 'a[href^="/topic/"]').text.replace("\n", "")
 description = container.select_one(
 'div.desc').text.replace("\n", "")
 if topic or description is not None:
 dict_topic = topic
 dict_description = description
 if dict_description is '':
 dict_description = 'No Data'
 # list_description.append(dict_description)
 #so no empty string#
 list_topic.append(dict_topic)
 list_description.append(dict_description)
 else:
 None
 except Exception as e:
 print('Error.extractDataFromRow1:', e)
 return None
def extractDataFromRow2(_url):
 try:
 for container in _url.select('table[cellspacing="1"] > tr')[2:32]:
 replies = container.select_one('td:nth-of-type(4)').text.strip()
 topic_started = container.select_one(
 'td:nth-of-type(5)').text.strip()
 total_views = container.select_one(
 'td:nth-of-type(6)').text.strip()
 if replies or topic_started or total_views is not None:
 dict_replies = replies
 dict_topicStarted = topic_started
 dict_totalViews = total_views
 if dict_replies is '':
 dict_replies = 'No Data'
 elif dict_topicStarted is '':
 dict_topicStarted = 'No Data'
 elif dict_totalViews is '':
 dict_totalViews = 'No Data'
 list_replies.append(dict_replies)
 list_topicStarted.append(dict_topicStarted)
 list_totalViews.append(dict_totalViews)
 else:
 print('no data')
 None
 except Exception as e:
 print('Error.extractDataFromRow2:', e)
 return None
# limit to 1740
print(iterateThroughPages(1740, 30, url))
new_panda = pd.DataFrame(
 {'Title': list_topic, 'Description': list_description,
 'Replies': list_replies, 'Topic Starter': list_topicStarted, 'Total Views': list_totalViews})
print(new_panda)

I'm sure the use of my try is redundant at this point as well, my large variety of List including, and the use of While and For is most likely practiced wrongly.

Question 2

I would separate the two concerns of getting the table data and processing it a bit more. For this it might make sense to have one generator that just yields rows from the table and gets the next page if needed:

import requests
from bs4 import BeautifulSoup, SoupStrainer
SESSION = requests.Session()
def get_table_rows(base_url, posts_per_page=30):
 """Continously yield rows from the posts table.
 Requests a new page only when needed.
 """
 start_at = 0
 while True:
 print(f'current page index is: {start_at // posts_per_page + 1}')
 response = SESSION.get(base_url + f"/+{start_at}")
 response.raise_for_status()
 soup = BeautifulSoup(response.text, 'lxml',
 parse_only=SoupStrainer("table", {"cellspacing": "1"}))
 yield from soup.find_all("tr")
 start_at += posts_per_page

This already chooses only the correct table, but still contains the header row. It also reuses the connection to the server by using a requests.Session. This is an infinite generator. Choosing to only get the first n entries is done later using itertools.islice.

Now we just need to parse a single table row, which can go to another function:

def parse_row(row):
 """Get info from a row"""
 columns = row.select("td")
 try:
 if not columns or columns[0]["class"] in (["darkrow1"], ["nopad"]):
 return
 except KeyError: # first column has no class
 # print(row)
 return
 try:
 title = row.select_one("td.row1 a[href^=/topic/]").text.strip() or "No Data"
 description = row.select_one("td.row1 div.desc").text.strip() or "No Data"
 replies = row.select_one("td:nth-of-type(4)").text.strip() or "No Data"
 topic_starter = row.select_one('td:nth-of-type(5)').text.strip() or "No Data"
 total_views = row.select_one('td:nth-of-type(6)').text.strip() or "No Data"
 except AttributeError: # something is None
 # print(row)
 return
 return {"Title": title,
 "Description": description,
 "Replies": replies,
 "Topic Starter": topic_starter,
 "Total Views": total_views}
def parse_rows(url):
 """Filter out rows that could not be parsed"""
 yield from filter(None, (parse_row(row) for row in get_table_rows(url)))

Then your main loop just becomes this:

from itertools import islice
import pandas as pd
if __name__ == "__main__":
 url = 'https://forum.lowyat.net/ReviewsandGuides'
 max_posts = 1740
 df = pd.DataFrame.from_records(islice(parse_rows(url), max_posts))
 print(df)

Note that I (mostly) followed Python's official style-guide, PEP8, especially when naming variables (lower_case). This code also has a if __name__ == "__main__": guard to allow importing from this script from another script and the functions have (probably too short) docstrings describing what each function does.

Question 3

after hitting index 53; the response becomes a <Response [403]> and goes into an infinite loop; while all previous ones before index 53 are 200, am I understanding the concept wrongly atm ~? Does the page index through each row one by one or is it indexing each page~? Sorry that I'm asking more when you've taken the time to help and explain many of them. I'm trying to understand most of them, using the referral you've given but at the same time I'm having that return output.

Question 4

oh; I realized what happened. I got IP blocked from the site for requesting too often.

Question 5

After adding threading to prevent spam; it worked beautifully~ thank you so much

Question 6

@Minial Yeah that is probably the real solution to that problem, but not being stuck in an infinite loop is nice in addition :-).

Question 7

Yup; once it reaches the end post it stops~ but regardless; thanks for your time and effort :) much appreciation in helping me improve my coding; I've pretty much applied to my other methods and reduced lots of unnecessary and repetitive for loops after understanding how yours worked.

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2019-01-10 14:13:11Z

I would separate the two concerns of getting the table data and processing it a bit more. For this it might make sense to have one generator that just yields rows from the table and gets the next page if needed:

import requests
from bs4 import BeautifulSoup, SoupStrainer
SESSION = requests.Session()
def get_table_rows(base_url, posts_per_page=30):
 """Continously yield rows from the posts table.
 Requests a new page only when needed.
 """
 start_at = 0
 while True:
 print(f'current page index is: {start_at // posts_per_page + 1}')
 response = SESSION.get(base_url + f"/+{start_at}")
 response.raise_for_status()
 soup = BeautifulSoup(response.text, 'lxml',
 parse_only=SoupStrainer("table", {"cellspacing": "1"}))
 yield from soup.find_all("tr")
 start_at += posts_per_page

This already chooses only the correct table, but still contains the header row. It also reuses the connection to the server by using a requests.Session. This is an infinite generator. Choosing to only get the first n entries is done later using itertools.islice.

Now we just need to parse a single table row, which can go to another function:

def parse_row(row):
 """Get info from a row"""
 columns = row.select("td")
 try:
 if not columns or columns[0]["class"] in (["darkrow1"], ["nopad"]):
 return
 except KeyError: # first column has no class
 # print(row)
 return
 try:
 title = row.select_one("td.row1 a[href^=/topic/]").text.strip() or "No Data"
 description = row.select_one("td.row1 div.desc").text.strip() or "No Data"
 replies = row.select_one("td:nth-of-type(4)").text.strip() or "No Data"
 topic_starter = row.select_one('td:nth-of-type(5)').text.strip() or "No Data"
 total_views = row.select_one('td:nth-of-type(6)').text.strip() or "No Data"
 except AttributeError: # something is None
 # print(row)
 return
 return {"Title": title,
 "Description": description,
 "Replies": replies,
 "Topic Starter": topic_starter,
 "Total Views": total_views}
def parse_rows(url):
 """Filter out rows that could not be parsed"""
 yield from filter(None, (parse_row(row) for row in get_table_rows(url)))

Then your main loop just becomes this:

from itertools import islice
import pandas as pd
if __name__ == "__main__":
 url = 'https://forum.lowyat.net/ReviewsandGuides'
 max_posts = 1740
 df = pd.DataFrame.from_records(islice(parse_rows(url), max_posts))
 print(df)

Note that I (mostly) followed Python's official style-guide, PEP8, especially when naming variables (lower_case). This code also has a if __name__ == "__main__": guard to allow importing from this script from another script and the functions have (probably too short) docstrings describing what each function does.

after hitting index 53; the response becomes a <Response [403]> and goes into an infinite loop; while all previous ones before index 53 are 200, am I understanding the concept wrongly atm ~? Does the page index through each row one by one or is it indexing each page~? Sorry that I'm asking more when you've taken the time to help and explain many of them. I'm trying to understand most of them, using the referral you've given but at the same time I'm having that return output.
oh; I realized what happened. I got IP blocked from the site for requesting too often.
After adding threading to prevent spam; it worked beautifully~ thank you so much
@Minial Yeah that is probably the real solution to that problem, but not being stuck in an infinite loop is nice in addition :-).
Yup; once it reaches the end post it stops~ but regardless; thanks for your time and effort :) much appreciation in helping me improve my coding; I've pretty much applied to my other methods and reduced lots of unnecessary and repetitive for loops after understanding how yours worked.

Stack Exchange Network

Cleaner way of appending data to List in BeautifulSoup

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Cleaner way of appending data to List in BeautifulSoup

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions