A simple web scraper for nature.com news articles

Question 1

I have created a simple web scraper that fetches news article previews from nature.com and saves each article to a file containing the article preview text.

I am learning independently, so I would like feedback on code structure, code cleanliness, and OOP design (or even whether OOP is the best approach here, as it's my first time thinking about using OOP in projects).

Please could someone review my code and suggest areas for improvement?

import re
import string
import requests
from bs4 import BeautifulSoup
class Article:
 def __init__(self, title, body):
 self.__title = title
 self.__format_title()
 self.__body = body.strip()
 @property
 def title(self):
 return self.__title
 @property
 def body(self):
 return self.__body
 def __format_title(self):
 return re.sub(f'[{string.punctuation}]', '', self.__title).strip().replace(' ', '_')
class ArticleScraper:
 def __init__(self, url):
 self.url = url
 self.response = None
 self.__soup = None
 self.__articles = []
 @property
 def articles(self):
 return self.__articles
 def save(self):
 self.__scrape()
 self.__get_articles_from_soup()
 self.__write_articles_to_file()
 def __scrape(self):
 self.__make_request()
 self.__soup = BeautifulSoup(self.response.content, 'html.parser')
 def __get_articles_from_soup(self):
 articles = self.__soup.find_all('article')
 for article in articles:
 article_type = article.find('span', {'data-test': 'article.type'}).text.strip()
 if article_type == 'News':
 link = article.find('a', {'data-track-action': 'view article'}).get('href')
 if link:
 article_url = f'https://www.nature.com{link}'
 resp = requests.get(article_url)
 resp.raise_for_status()
 article_soup = BeautifulSoup(resp.content, 'html.parser')
 article_title = article_soup.find('title').text
 # cannot get article body as nature.com now behind paywall, get teaser text instead
 article_body = article_soup.find('div', {'class': 'c-article-teaser-text'}).text
 self.__articles.append(Article(article_title, article_body))
 def __write_articles_to_file(self):
 for article in self.__articles:
 with open(f'{article.title}.txt', 'w', encoding='utf-8') as f:
 f.write(article.body)
 def __make_request(self):
 self.response = requests.get(self.url, headers={'Accept-Language': 'en-US,en;q=0.5'})
 self.response.raise_for_status()
if __name__ == '__main__':
 url_to_request = "https://www.nature.com/nature/articles?sort=PubDate&year=2020&page=3"
 webpage_scraper = ArticleScraper(url_to_request)
 try:
 webpage_scraper.save()
 print('Saved articles:')
 [print(f'{article.title}.txt') for article in webpage_scraper.articles]
 except requests.HTTPError:
 print(f'The URL returned {webpage_scraper.response.status_code}')

Question 2

 def __init__(self, title, body):
 self.__title = title
 self.__format_title()
 ...
 def __format_title(self):
 return re.sub(f'[{string.punctuation}]', '', self.__title).strip().replace(' ', '_')

This doesn't make sense. There's no unit tests in this submission, which probably corresponds to no unit tests in the original source. A simple test should have immediately caught the omitted assignment.

The format title helper is nice. (Well, fine, its name should start with just a single _ underscore, whatever.) But then caller fails to assign its very nice return value. You formatted the title, and then discarded the result.

More generally, there's a bunch of other identifiers here which start with double underscore where we expect single, as described in PEP-8.

It's unclear why we have title / body getters. Please understand that Python is not Java. We're all grownups here. Use zero or one leading underscores to describe whether an attribute is public or private.

Kudos on choosing a noun, ArticleScraper, to name the class.

And of course, we usually prefer verbs for method names.

In the ArticleScraper constructor, kudos on introducing four attributes; that's a helpful directory that aids the reader in understanding the rest of the class.

Again, the articles getter appears to be pointless. It's not like it returns a .copy() or anything. Delete it.

You had good instincts here (especially when you followed the raise for status idiom), but the details didn't work out.

 def __scrape(self):
 self.__make_request()
 self.__soup = BeautifulSoup(self.response.content, 'html.parser')
 def __make_request(self):
 self.response = requests.get(self.url, headers={'Accept-Language': 'en-US,en;q=0.5'})
 self.response.raise_for_status()

Prefer to phrase the helper this way:

 @staticmethod
 def _make_request(url):
 response = requests.get(url, ...)
 response.raise_for_status()
 return response

And then the caller assigns

 self.response = _make_request(self.url)

This carries several benefits. There is less coupling, the helper is simpler to understand, and we can more easily write unit tests for it.

~~(削除) The JPEG image quality setting, ;q=0.5, doesn't make sense here. (削除ここまで)~~ (Sorry, my bad, please see correction in the comments).

 [print(f'{article.title}.txt') for article in webpage_scraper.articles]

Here you evaluate a list comprehension for side effects and discard the list. Which is not a clear way to express Author's Intent. Please don't do that. Follow the usual idiom:

 for article in webpage_scraper.articles:
 print(...)

Overall?

This submission contained zero unit tests, and in its current state it does not appear to be amenable to writing simple test methods that exercise existing helpers. Zero tests doesn't boost our confidence in correctness of this code base.

For a small amount of code, I have to say there is a lot of coupling going on here. Entirely too much. I found it much too hard to read save(), as it uses zero formal parameters, passing things around through object attributes which in this context is essentially global variables. Everything is evaluated for side effects. It felt like the next line of code I read was bound to be a Fortran COMMON statement.

You're keeping functions small; that is good. You're trying to adhere to making them do one thing, also good. The next thing you should try practising is the notion of passing in one or two relevant parameters and getting back a result which you assign (or pass to next helper). It will improve the clarity of your code, of your intent.

In its current state, I would not be willing to delegate or accept code maintenance tasks for this codebase.

Question 3

q=0.5 is nothing to do with compression in that context - it's used as a content negotiation parameter, telling the Web server that the generic English version is half as valuable as any American-English equivalent it might be offering.

J_H J_H 41.8k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2023-03-10 00:43:54Z

 def __init__(self, title, body):
 self.__title = title
 self.__format_title()
 ...
 def __format_title(self):
 return re.sub(f'[{string.punctuation}]', '', self.__title).strip().replace(' ', '_')

This doesn't make sense. There's no unit tests in this submission, which probably corresponds to no unit tests in the original source. A simple test should have immediately caught the omitted assignment.

The format title helper is nice. (Well, fine, its name should start with just a single _ underscore, whatever.) But then caller fails to assign its very nice return value. You formatted the title, and then discarded the result.

More generally, there's a bunch of other identifiers here which start with double underscore where we expect single, as described in PEP-8.

It's unclear why we have title / body getters. Please understand that Python is not Java. We're all grownups here. Use zero or one leading underscores to describe whether an attribute is public or private.

Kudos on choosing a noun, ArticleScraper, to name the class.

And of course, we usually prefer verbs for method names.

In the ArticleScraper constructor, kudos on introducing four attributes; that's a helpful directory that aids the reader in understanding the rest of the class.

Again, the articles getter appears to be pointless. It's not like it returns a .copy() or anything. Delete it.

You had good instincts here (especially when you followed the raise for status idiom), but the details didn't work out.

 def __scrape(self):
 self.__make_request()
 self.__soup = BeautifulSoup(self.response.content, 'html.parser')
 def __make_request(self):
 self.response = requests.get(self.url, headers={'Accept-Language': 'en-US,en;q=0.5'})
 self.response.raise_for_status()

Prefer to phrase the helper this way:

 @staticmethod
 def _make_request(url):
 response = requests.get(url, ...)
 response.raise_for_status()
 return response

And then the caller assigns

 self.response = _make_request(self.url)

This carries several benefits. There is less coupling, the helper is simpler to understand, and we can more easily write unit tests for it.

~~(削除) The JPEG image quality setting, ;q=0.5, doesn't make sense here. (削除ここまで)~~ (Sorry, my bad, please see correction in the comments).

 [print(f'{article.title}.txt') for article in webpage_scraper.articles]

Here you evaluate a list comprehension for side effects and discard the list. Which is not a clear way to express Author's Intent. Please don't do that. Follow the usual idiom:

 for article in webpage_scraper.articles:
 print(...)

Overall?

This submission contained zero unit tests, and in its current state it does not appear to be amenable to writing simple test methods that exercise existing helpers. Zero tests doesn't boost our confidence in correctness of this code base.

For a small amount of code, I have to say there is a lot of coupling going on here. Entirely too much. I found it much too hard to read save(), as it uses zero formal parameters, passing things around through object attributes which in this context is essentially global variables. Everything is evaluated for side effects. It felt like the next line of code I read was bound to be a Fortran COMMON statement.

You're keeping functions small; that is good. You're trying to adhere to making them do one thing, also good. The next thing you should try practising is the notion of passing in one or two relevant parameters and getting back a result which you assign (or pass to next helper). It will improve the clarity of your code, of your intent.

In its current state, I would not be willing to delegate or accept code maintenance tasks for this codebase.

q=0.5 is nothing to do with compression in that context - it's used as a content negotiation parameter, telling the Web server that the generic English version is half as valuable as any American-English equivalent it might be offering.

Stack Exchange Network

A simple web scraper for nature.com news articles

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

A simple web scraper for nature.com news articles

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions