3
\$\begingroup\$

I have created a simple web scraper that fetches news article previews from nature.com and saves each article to a file containing the article preview text.

I am learning independently, so I would like feedback on code structure, code cleanliness, and OOP design (or even whether OOP is the best approach here, as it's my first time thinking about using OOP in projects).

Please could someone review my code and suggest areas for improvement?

import re
import string
import requests
from bs4 import BeautifulSoup
class Article:
 def __init__(self, title, body):
 self.__title = title
 self.__format_title()
 self.__body = body.strip()
 @property
 def title(self):
 return self.__title
 @property
 def body(self):
 return self.__body
 def __format_title(self):
 return re.sub(f'[{string.punctuation}]', '', self.__title).strip().replace(' ', '_')
class ArticleScraper:
 def __init__(self, url):
 self.url = url
 self.response = None
 self.__soup = None
 self.__articles = []
 @property
 def articles(self):
 return self.__articles
 def save(self):
 self.__scrape()
 self.__get_articles_from_soup()
 self.__write_articles_to_file()
 def __scrape(self):
 self.__make_request()
 self.__soup = BeautifulSoup(self.response.content, 'html.parser')
 def __get_articles_from_soup(self):
 articles = self.__soup.find_all('article')
 for article in articles:
 article_type = article.find('span', {'data-test': 'article.type'}).text.strip()
 if article_type == 'News':
 link = article.find('a', {'data-track-action': 'view article'}).get('href')
 if link:
 article_url = f'https://www.nature.com{link}'
 resp = requests.get(article_url)
 resp.raise_for_status()
 article_soup = BeautifulSoup(resp.content, 'html.parser')
 article_title = article_soup.find('title').text
 # cannot get article body as nature.com now behind paywall, get teaser text instead
 article_body = article_soup.find('div', {'class': 'c-article-teaser-text'}).text
 self.__articles.append(Article(article_title, article_body))
 def __write_articles_to_file(self):
 for article in self.__articles:
 with open(f'{article.title}.txt', 'w', encoding='utf-8') as f:
 f.write(article.body)
 def __make_request(self):
 self.response = requests.get(self.url, headers={'Accept-Language': 'en-US,en;q=0.5'})
 self.response.raise_for_status()
if __name__ == '__main__':
 url_to_request = "https://www.nature.com/nature/articles?sort=PubDate&year=2020&page=3"
 webpage_scraper = ArticleScraper(url_to_request)
 try:
 webpage_scraper.save()
 print('Saved articles:')
 [print(f'{article.title}.txt') for article in webpage_scraper.articles]
 except requests.HTTPError:
 print(f'The URL returned {webpage_scraper.response.status_code}')
Toby Speight
87.2k14 gold badges104 silver badges322 bronze badges
asked Mar 9, 2023 at 17:39
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$
 def __init__(self, title, body):
 self.__title = title
 self.__format_title()
 ...
 def __format_title(self):
 return re.sub(f'[{string.punctuation}]', '', self.__title).strip().replace(' ', '_')

This doesn't make sense. There's no unit tests in this submission, which probably corresponds to no unit tests in the original source. A simple test should have immediately caught the omitted assignment.

The format title helper is nice. (Well, fine, its name should start with just a single _ underscore, whatever.) But then caller fails to assign its very nice return value. You formatted the title, and then discarded the result.

More generally, there's a bunch of other identifiers here which start with double underscore where we expect single, as described in PEP-8.

It's unclear why we have title / body getters. Please understand that Python is not Java. We're all grownups here. Use zero or one leading underscores to describe whether an attribute is public or private.


Kudos on choosing a noun, ArticleScraper, to name the class.

And of course, we usually prefer verbs for method names.


In the ArticleScraper constructor, kudos on introducing four attributes; that's a helpful directory that aids the reader in understanding the rest of the class.

Again, the articles getter appears to be pointless. It's not like it returns a .copy() or anything. Delete it.

You had good instincts here (especially when you followed the raise for status idiom), but the details didn't work out.

 def __scrape(self):
 self.__make_request()
 self.__soup = BeautifulSoup(self.response.content, 'html.parser')
 def __make_request(self):
 self.response = requests.get(self.url, headers={'Accept-Language': 'en-US,en;q=0.5'})
 self.response.raise_for_status()

Prefer to phrase the helper this way:

 @staticmethod
 def _make_request(url):
 response = requests.get(url, ...)
 response.raise_for_status()
 return response

And then the caller assigns

 self.response = _make_request(self.url)

This carries several benefits. There is less coupling, the helper is simpler to understand, and we can more easily write unit tests for it.

(削除) The JPEG image quality setting, ;q=0.5, doesn't make sense here. (削除ここまで) (Sorry, my bad, please see correction in the comments).


 [print(f'{article.title}.txt') for article in webpage_scraper.articles]

Here you evaluate a list comprehension for side effects and discard the list. Which is not a clear way to express Author's Intent. Please don't do that. Follow the usual idiom:

 for article in webpage_scraper.articles:
 print(...)

Overall?

This submission contained zero unit tests, and in its current state it does not appear to be amenable to writing simple test methods that exercise existing helpers. Zero tests doesn't boost our confidence in correctness of this code base.

For a small amount of code, I have to say there is a lot of coupling going on here. Entirely too much. I found it much too hard to read save(), as it uses zero formal parameters, passing things around through object attributes which in this context is essentially global variables. Everything is evaluated for side effects. It felt like the next line of code I read was bound to be a Fortran COMMON statement.

You're keeping functions small; that is good. You're trying to adhere to making them do one thing, also good. The next thing you should try practising is the notion of passing in one or two relevant parameters and getting back a result which you assign (or pass to next helper). It will improve the clarity of your code, of your intent.

In its current state, I would not be willing to delegate or accept code maintenance tasks for this codebase.

answered Mar 10, 2023 at 0:43
\$\endgroup\$
1
  • \$\begingroup\$ q=0.5 is nothing to do with compression in that context - it's used as a content negotiation parameter, telling the Web server that the generic English version is half as valuable as any American-English equivalent it might be offering. \$\endgroup\$ Commented Mar 10, 2023 at 7:20

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.