So I've been experimenting various way to get data from different variety of website; as such, between the use of JSON or BeautifulSoup. Currently, I have written a scraper to collect data such as [{Title,Description,Replies,Topic_Starter, Total_Views}]
; but it pretty much has no reusable code. I've been figuring out how to correct my approach of appending data to one singular list for simplicity and reusability. But I've pretty much hit a stone with my current capability.
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
url = 'https://forum.lowyat.net/ReviewsandGuides'
list_topic = []
list_description = []
list_replies = []
list_topicStarted = []
list_totalViews = []
def getContentFromURL(_url):
try:
response = get(_url)
html_soup = BeautifulSoup(response.text, 'lxml')
return html_soup
except Exception as e:
print('Error.getContentFromURL:', e)
return None
def iterateThroughPages(_lastindexpost, _postperpage, _url):
indices = '/+'
index = 0
for i in range(index, _lastindexpost):
print('Getting data from ' + url)
try:
extractDataFromRow1(getContentFromURL(_url))
extractDataFromRow2(getContentFromURL(_url))
print('current page index is: ' + str(index))
print(_url)
while i <= _lastindexpost:
for table in get(_url):
if table != None:
new_getPostPerPage = i + _postperpage
newlink = f'{url}{indices}{new_getPostPerPage}'
print(newlink)
bs_link = getContentFromURL(newlink)
extractDataFromRow1(bs_link)
extractDataFromRow2(bs_link)
# threading to prevent spam. Waits 0.5 secs before executing
sleep(0.5)
i += _postperpage
print('current page index is: ' + str(i))
if i > _lastindexpost:
# If i gets more than the input page(etc 1770) halts
print('No more available post to retrieve')
return
except Exception as e:
print('Error.iterateThroughPages:', e)
return None
def extractDataFromRow1(_url):
try:
for container in _url.find_all('td', {'class': 'row1', 'valign': 'middle'}):
# get data from topic title in table cell
topic = container.select_one(
'a[href^="/topic/"]').text.replace("\n", "")
description = container.select_one(
'div.desc').text.replace("\n", "")
if topic or description is not None:
dict_topic = topic
dict_description = description
if dict_description is '':
dict_description = 'No Data'
# list_description.append(dict_description)
#so no empty string#
list_topic.append(dict_topic)
list_description.append(dict_description)
else:
None
except Exception as e:
print('Error.extractDataFromRow1:', e)
return None
def extractDataFromRow2(_url):
try:
for container in _url.select('table[cellspacing="1"] > tr')[2:32]:
replies = container.select_one('td:nth-of-type(4)').text.strip()
topic_started = container.select_one(
'td:nth-of-type(5)').text.strip()
total_views = container.select_one(
'td:nth-of-type(6)').text.strip()
if replies or topic_started or total_views is not None:
dict_replies = replies
dict_topicStarted = topic_started
dict_totalViews = total_views
if dict_replies is '':
dict_replies = 'No Data'
elif dict_topicStarted is '':
dict_topicStarted = 'No Data'
elif dict_totalViews is '':
dict_totalViews = 'No Data'
list_replies.append(dict_replies)
list_topicStarted.append(dict_topicStarted)
list_totalViews.append(dict_totalViews)
else:
print('no data')
None
except Exception as e:
print('Error.extractDataFromRow2:', e)
return None
# limit to 1740
print(iterateThroughPages(1740, 30, url))
new_panda = pd.DataFrame(
{'Title': list_topic, 'Description': list_description,
'Replies': list_replies, 'Topic Starter': list_topicStarted, 'Total Views': list_totalViews})
print(new_panda)
I'm sure the use of my try
is redundant at this point as well, my large variety of List including, and the use of While
and For
is most likely practiced wrongly.
1 Answer 1
I would separate the two concerns of getting the table data and processing it a bit more. For this it might make sense to have one generator that just yields rows from the table and gets the next page if needed:
import requests
from bs4 import BeautifulSoup, SoupStrainer
SESSION = requests.Session()
def get_table_rows(base_url, posts_per_page=30):
"""Continously yield rows from the posts table.
Requests a new page only when needed.
"""
start_at = 0
while True:
print(f'current page index is: {start_at // posts_per_page + 1}')
response = SESSION.get(base_url + f"/+{start_at}")
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml',
parse_only=SoupStrainer("table", {"cellspacing": "1"}))
yield from soup.find_all("tr")
start_at += posts_per_page
This already chooses only the correct table, but still contains the header row. It also reuses the connection to the server by using a requests.Session
. This is an infinite generator. Choosing to only get the first n entries is done later using itertools.islice
.
Now we just need to parse a single table row, which can go to another function:
def parse_row(row):
"""Get info from a row"""
columns = row.select("td")
try:
if not columns or columns[0]["class"] in (["darkrow1"], ["nopad"]):
return
except KeyError: # first column has no class
# print(row)
return
try:
title = row.select_one("td.row1 a[href^=/topic/]").text.strip() or "No Data"
description = row.select_one("td.row1 div.desc").text.strip() or "No Data"
replies = row.select_one("td:nth-of-type(4)").text.strip() or "No Data"
topic_starter = row.select_one('td:nth-of-type(5)').text.strip() or "No Data"
total_views = row.select_one('td:nth-of-type(6)').text.strip() or "No Data"
except AttributeError: # something is None
# print(row)
return
return {"Title": title,
"Description": description,
"Replies": replies,
"Topic Starter": topic_starter,
"Total Views": total_views}
def parse_rows(url):
"""Filter out rows that could not be parsed"""
yield from filter(None, (parse_row(row) for row in get_table_rows(url)))
Then your main loop just becomes this:
from itertools import islice
import pandas as pd
if __name__ == "__main__":
url = 'https://forum.lowyat.net/ReviewsandGuides'
max_posts = 1740
df = pd.DataFrame.from_records(islice(parse_rows(url), max_posts))
print(df)
Note that I (mostly) followed Python's official style-guide, PEP8, especially when naming variables (lower_case
). This code also has a if __name__ == "__main__":
guard to allow importing from this script from another script and the functions have (probably too short) docstrings describing what each function does.
-
\$\begingroup\$ after hitting index 53; the response becomes a
<Response [403]>
and goes into an infinite loop; while all previous ones before index 53 are200
, am I understanding the concept wrongly atm ~? Does the page index through each row one by one or is it indexing each page~? Sorry that I'm asking more when you've taken the time to help and explain many of them. I'm trying to understand most of them, using the referral you've given but at the same time I'm having that return output. \$\endgroup\$Minial– Minial2019年01月11日 02:31:59 +00:00Commented Jan 11, 2019 at 2:31 -
1\$\begingroup\$ oh; I realized what happened. I got IP blocked from the site for requesting too often. \$\endgroup\$Minial– Minial2019年01月11日 02:53:39 +00:00Commented Jan 11, 2019 at 2:53
-
\$\begingroup\$ After adding threading to prevent spam; it worked beautifully~ thank you so much \$\endgroup\$Minial– Minial2019年01月11日 03:34:15 +00:00Commented Jan 11, 2019 at 3:34
-
1\$\begingroup\$ @Minial Yeah that is probably the real solution to that problem, but not being stuck in an infinite loop is nice in addition :-). \$\endgroup\$Graipher– Graipher2019年01月11日 06:49:42 +00:00Commented Jan 11, 2019 at 6:49
-
1\$\begingroup\$ Yup; once it reaches the end post it stops~ but regardless; thanks for your time and effort :) much appreciation in helping me improve my coding; I've pretty much applied to my other methods and reduced lots of unnecessary and repetitive for loops after understanding how yours worked. \$\endgroup\$Minial– Minial2019年01月11日 06:50:24 +00:00Commented Jan 11, 2019 at 6:50
Explore related questions
See similar questions with these tags.