4
\$\begingroup\$

So I've been experimenting various way to get data from different variety of website; as such, between the use of JSON or BeautifulSoup. Currently, I have written a scraper to collect data such as [{Title,Description,Replies,Topic_Starter, Total_Views}]; but it pretty much has no reusable code. I've been figuring out how to correct my approach of appending data to one singular list for simplicity and reusability. But I've pretty much hit a stone with my current capability.

from requests import get
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
url = 'https://forum.lowyat.net/ReviewsandGuides'
list_topic = []
list_description = []
list_replies = []
list_topicStarted = []
list_totalViews = []
def getContentFromURL(_url):
 try:
 response = get(_url)
 html_soup = BeautifulSoup(response.text, 'lxml')
 return html_soup
 except Exception as e:
 print('Error.getContentFromURL:', e)
 return None
def iterateThroughPages(_lastindexpost, _postperpage, _url):
 indices = '/+'
 index = 0
 for i in range(index, _lastindexpost):
 print('Getting data from ' + url)
 try:
 extractDataFromRow1(getContentFromURL(_url))
 extractDataFromRow2(getContentFromURL(_url))
 print('current page index is: ' + str(index))
 print(_url)
 while i <= _lastindexpost:
 for table in get(_url):
 if table != None:
 new_getPostPerPage = i + _postperpage
 newlink = f'{url}{indices}{new_getPostPerPage}'
 print(newlink)
 bs_link = getContentFromURL(newlink)
 extractDataFromRow1(bs_link)
 extractDataFromRow2(bs_link)
 # threading to prevent spam. Waits 0.5 secs before executing
 sleep(0.5)
 i += _postperpage
 print('current page index is: ' + str(i))
 if i > _lastindexpost:
 # If i gets more than the input page(etc 1770) halts
 print('No more available post to retrieve')
 return
 except Exception as e:
 print('Error.iterateThroughPages:', e)
 return None
def extractDataFromRow1(_url):
 try:
 for container in _url.find_all('td', {'class': 'row1', 'valign': 'middle'}):
 # get data from topic title in table cell
 topic = container.select_one(
 'a[href^="/topic/"]').text.replace("\n", "")
 description = container.select_one(
 'div.desc').text.replace("\n", "")
 if topic or description is not None:
 dict_topic = topic
 dict_description = description
 if dict_description is '':
 dict_description = 'No Data'
 # list_description.append(dict_description)
 #so no empty string#
 list_topic.append(dict_topic)
 list_description.append(dict_description)
 else:
 None
 except Exception as e:
 print('Error.extractDataFromRow1:', e)
 return None
def extractDataFromRow2(_url):
 try:
 for container in _url.select('table[cellspacing="1"] > tr')[2:32]:
 replies = container.select_one('td:nth-of-type(4)').text.strip()
 topic_started = container.select_one(
 'td:nth-of-type(5)').text.strip()
 total_views = container.select_one(
 'td:nth-of-type(6)').text.strip()
 if replies or topic_started or total_views is not None:
 dict_replies = replies
 dict_topicStarted = topic_started
 dict_totalViews = total_views
 if dict_replies is '':
 dict_replies = 'No Data'
 elif dict_topicStarted is '':
 dict_topicStarted = 'No Data'
 elif dict_totalViews is '':
 dict_totalViews = 'No Data'
 list_replies.append(dict_replies)
 list_topicStarted.append(dict_topicStarted)
 list_totalViews.append(dict_totalViews)
 else:
 print('no data')
 None
 except Exception as e:
 print('Error.extractDataFromRow2:', e)
 return None
# limit to 1740
print(iterateThroughPages(1740, 30, url))
new_panda = pd.DataFrame(
 {'Title': list_topic, 'Description': list_description,
 'Replies': list_replies, 'Topic Starter': list_topicStarted, 'Total Views': list_totalViews})
print(new_panda)

I'm sure the use of my try is redundant at this point as well, my large variety of List including, and the use of While and For is most likely practiced wrongly.

Reinderien
70.9k5 gold badges76 silver badges256 bronze badges
asked Jan 10, 2019 at 4:54
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

I would separate the two concerns of getting the table data and processing it a bit more. For this it might make sense to have one generator that just yields rows from the table and gets the next page if needed:

import requests
from bs4 import BeautifulSoup, SoupStrainer
SESSION = requests.Session()
def get_table_rows(base_url, posts_per_page=30):
 """Continously yield rows from the posts table.
 Requests a new page only when needed.
 """
 start_at = 0
 while True:
 print(f'current page index is: {start_at // posts_per_page + 1}')
 response = SESSION.get(base_url + f"/+{start_at}")
 response.raise_for_status()
 soup = BeautifulSoup(response.text, 'lxml',
 parse_only=SoupStrainer("table", {"cellspacing": "1"}))
 yield from soup.find_all("tr")
 start_at += posts_per_page

This already chooses only the correct table, but still contains the header row. It also reuses the connection to the server by using a requests.Session. This is an infinite generator. Choosing to only get the first n entries is done later using itertools.islice.

Now we just need to parse a single table row, which can go to another function:

def parse_row(row):
 """Get info from a row"""
 columns = row.select("td")
 try:
 if not columns or columns[0]["class"] in (["darkrow1"], ["nopad"]):
 return
 except KeyError: # first column has no class
 # print(row)
 return
 try:
 title = row.select_one("td.row1 a[href^=/topic/]").text.strip() or "No Data"
 description = row.select_one("td.row1 div.desc").text.strip() or "No Data"
 replies = row.select_one("td:nth-of-type(4)").text.strip() or "No Data"
 topic_starter = row.select_one('td:nth-of-type(5)').text.strip() or "No Data"
 total_views = row.select_one('td:nth-of-type(6)').text.strip() or "No Data"
 except AttributeError: # something is None
 # print(row)
 return
 return {"Title": title,
 "Description": description,
 "Replies": replies,
 "Topic Starter": topic_starter,
 "Total Views": total_views}
def parse_rows(url):
 """Filter out rows that could not be parsed"""
 yield from filter(None, (parse_row(row) for row in get_table_rows(url)))

Then your main loop just becomes this:

from itertools import islice
import pandas as pd
if __name__ == "__main__":
 url = 'https://forum.lowyat.net/ReviewsandGuides'
 max_posts = 1740
 df = pd.DataFrame.from_records(islice(parse_rows(url), max_posts))
 print(df)

Note that I (mostly) followed Python's official style-guide, PEP8, especially when naming variables (lower_case). This code also has a if __name__ == "__main__": guard to allow importing from this script from another script and the functions have (probably too short) docstrings describing what each function does.

answered Jan 10, 2019 at 14:13
\$\endgroup\$
7
  • \$\begingroup\$ after hitting index 53; the response becomes a <Response [403]> and goes into an infinite loop; while all previous ones before index 53 are 200, am I understanding the concept wrongly atm ~? Does the page index through each row one by one or is it indexing each page~? Sorry that I'm asking more when you've taken the time to help and explain many of them. I'm trying to understand most of them, using the referral you've given but at the same time I'm having that return output. \$\endgroup\$ Commented Jan 11, 2019 at 2:31
  • 1
    \$\begingroup\$ oh; I realized what happened. I got IP blocked from the site for requesting too often. \$\endgroup\$ Commented Jan 11, 2019 at 2:53
  • \$\begingroup\$ After adding threading to prevent spam; it worked beautifully~ thank you so much \$\endgroup\$ Commented Jan 11, 2019 at 3:34
  • 1
    \$\begingroup\$ @Minial Yeah that is probably the real solution to that problem, but not being stuck in an infinite loop is nice in addition :-). \$\endgroup\$ Commented Jan 11, 2019 at 6:49
  • 1
    \$\begingroup\$ Yup; once it reaches the end post it stops~ but regardless; thanks for your time and effort :) much appreciation in helping me improve my coding; I've pretty much applied to my other methods and reduced lots of unnecessary and repetitive for loops after understanding how yours worked. \$\endgroup\$ Commented Jan 11, 2019 at 6:50

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.