I have created a simple web scraper that fetches news article previews from nature.com and saves each article to a file containing the article preview text.
I am learning independently, so I would like feedback on code structure, code cleanliness, and OOP design (or even whether OOP is the best approach here, as it's my first time thinking about using OOP in projects).
Please could someone review my code and suggest areas for improvement?
import re
import string
import requests
from bs4 import BeautifulSoup
class Article:
def __init__(self, title, body):
self.__title = title
self.__format_title()
self.__body = body.strip()
@property
def title(self):
return self.__title
@property
def body(self):
return self.__body
def __format_title(self):
return re.sub(f'[{string.punctuation}]', '', self.__title).strip().replace(' ', '_')
class ArticleScraper:
def __init__(self, url):
self.url = url
self.response = None
self.__soup = None
self.__articles = []
@property
def articles(self):
return self.__articles
def save(self):
self.__scrape()
self.__get_articles_from_soup()
self.__write_articles_to_file()
def __scrape(self):
self.__make_request()
self.__soup = BeautifulSoup(self.response.content, 'html.parser')
def __get_articles_from_soup(self):
articles = self.__soup.find_all('article')
for article in articles:
article_type = article.find('span', {'data-test': 'article.type'}).text.strip()
if article_type == 'News':
link = article.find('a', {'data-track-action': 'view article'}).get('href')
if link:
article_url = f'https://www.nature.com{link}'
resp = requests.get(article_url)
resp.raise_for_status()
article_soup = BeautifulSoup(resp.content, 'html.parser')
article_title = article_soup.find('title').text
# cannot get article body as nature.com now behind paywall, get teaser text instead
article_body = article_soup.find('div', {'class': 'c-article-teaser-text'}).text
self.__articles.append(Article(article_title, article_body))
def __write_articles_to_file(self):
for article in self.__articles:
with open(f'{article.title}.txt', 'w', encoding='utf-8') as f:
f.write(article.body)
def __make_request(self):
self.response = requests.get(self.url, headers={'Accept-Language': 'en-US,en;q=0.5'})
self.response.raise_for_status()
if __name__ == '__main__':
url_to_request = "https://www.nature.com/nature/articles?sort=PubDate&year=2020&page=3"
webpage_scraper = ArticleScraper(url_to_request)
try:
webpage_scraper.save()
print('Saved articles:')
[print(f'{article.title}.txt') for article in webpage_scraper.articles]
except requests.HTTPError:
print(f'The URL returned {webpage_scraper.response.status_code}')
1 Answer 1
def __init__(self, title, body):
self.__title = title
self.__format_title()
...
def __format_title(self):
return re.sub(f'[{string.punctuation}]', '', self.__title).strip().replace(' ', '_')
This doesn't make sense. There's no unit tests in this submission, which probably corresponds to no unit tests in the original source. A simple test should have immediately caught the omitted assignment.
The format title helper is nice.
(Well, fine, its name should start with just a single _
underscore, whatever.)
But then caller fails to assign its very nice return value.
You formatted the title, and then discarded the result.
More generally, there's a bunch of other identifiers here which start with double underscore where we expect single, as described in PEP-8.
It's unclear why we have title
/ body
getters.
Please understand that Python is not Java.
We're all grownups here.
Use zero or one leading underscores to describe
whether an attribute is public or private.
Kudos on choosing a noun, ArticleScraper
, to name the class.
And of course, we usually prefer verbs for method names.
In the ArticleScraper
constructor,
kudos on introducing four attributes;
that's a helpful directory that aids the reader in
understanding the rest of the class.
Again, the articles
getter appears to be pointless.
It's not like it returns a .copy()
or anything.
Delete it.
You had good instincts here (especially when you followed the raise for status idiom), but the details didn't work out.
def __scrape(self):
self.__make_request()
self.__soup = BeautifulSoup(self.response.content, 'html.parser')
def __make_request(self):
self.response = requests.get(self.url, headers={'Accept-Language': 'en-US,en;q=0.5'})
self.response.raise_for_status()
Prefer to phrase the helper this way:
@staticmethod
def _make_request(url):
response = requests.get(url, ...)
response.raise_for_status()
return response
And then the caller assigns
self.response = _make_request(self.url)
This carries several benefits. There is less coupling, the helper is simpler to understand, and we can more easily write unit tests for it.
(削除) The JPEG image quality setting,
(Sorry, my bad, please see correction in the comments).;q=0.5
, doesn't make sense here. (削除ここまで)
[print(f'{article.title}.txt') for article in webpage_scraper.articles]
Here you evaluate a list comprehension for side effects and discard the list. Which is not a clear way to express Author's Intent. Please don't do that. Follow the usual idiom:
for article in webpage_scraper.articles:
print(...)
Overall?
This submission contained zero unit tests, and in its current state it does not appear to be amenable to writing simple test methods that exercise existing helpers. Zero tests doesn't boost our confidence in correctness of this code base.
For a small amount of code,
I have to say there is a lot of
coupling
going on here. Entirely too much.
I found it much too hard to read save()
,
as it uses zero formal parameters,
passing things around through object attributes
which in this context is essentially global variables.
Everything is evaluated for side effects.
It felt like the next line of code I read was bound
to be a Fortran COMMON
statement.
You're keeping functions small; that is good. You're trying to adhere to making them do one thing, also good. The next thing you should try practising is the notion of passing in one or two relevant parameters and getting back a result which you assign (or pass to next helper). It will improve the clarity of your code, of your intent.
In its current state, I would not be willing to delegate or accept code maintenance tasks for this codebase.
-
\$\begingroup\$
q=0.5
is nothing to do with compression in that context - it's used as a content negotiation parameter, telling the Web server that the generic English version is half as valuable as any American-English equivalent it might be offering. \$\endgroup\$Toby Speight– Toby Speight2023年03月10日 07:20:43 +00:00Commented Mar 10, 2023 at 7:20
Explore related questions
See similar questions with these tags.