Scrape a URL for articles to save

Question 1

Below is an exercise project I was doing on a certain educational site. It is supposed to parse a given (static address in this example) URL for html data, search articles of a given type there and then save their text data on your drive.

It is working fine but since there're no code reviews on this site, I'm sort of worried that I'm doing something wrong without understanding it. Specifically, I'm struggling to understand is which cases or to what aim I should use classes in Python.

What I want to ask is for someone to tell if the below code is a valid application of OOP (and if OOP approach should even be used for such task). And if not, why and what I should be looking for to improve.

import string
import os
import requests
from dataclasses import dataclass, field
from typing import ClassVar
from bs4 import BeautifulSoup as beauty
@dataclass
class WebScraper:
 CACHE: ClassVar[dict] = dict()
 BASE_URL: ClassVar[str] = 'https://www.nature.com'
 PAGE_URL: ClassVar[str] = 'https://www.nature.com/nature/articles?sort=PubDate&year=2020'
 page_number: int
 art_type: str
 raw_text: str = field(init=False)
 art_urls: dict = field(init=False)
 page_content: dict = field(init=False)
 @classmethod
 def save_content(cls):
 for page, articles in cls.CACHE.items():
 os.mkdir(page)
 for name, text in articles.items():
 path = os.path.join(page, f'{name}.txt')
 with open(path, 'wt', encoding='utf-8') as file:
 file.write(text)
 def __post_init__(self):
 self.page_content = dict()
 url = self.PAGE_URL + f"&page{self.page_number}"
 self.raw_text = requests.get(url, headers={'Accept-Language': 'en-US,en;q=0.5'}).text
 self.art_urls = dict()
 self.parse_page()
 def parse_page(self):
 soup = beauty(self.raw_text, 'html.parser')
 articles = soup.find_all('article')
 for art in articles:
 span = art.find('span', {'class': 'c-meta__type'})
 if span.text == self.art_type:
 link = art.find('a')
 url = self.BASE_URL + link['href']
 self.art_urls[link.text] = url
 def parse_articles(self):
 for name, url in self.art_urls.items():
 r = requests.get(url, headers={'Accept-Language': 'en-US,en;q=0.5'})
 soup = beauty(r.text, 'html.parser')
 article = soup.find('div', {'class': 'c-article-body'})
 file_name = ''.join(ch if ch not in string.punctuation else '' for ch in name)
 file_name = file_name.replace(' ', '_')
 self.page_content[file_name] = article.get_text()
 def cache_content(self):
 self.CACHE[f'Page_{self.page_number}'] = self.page_content
n_of_pages = int(input())
art_type = input()
for page_num in range(1, n_of_pages + 1):
 scraper = WebScraper(page_num, art_type)
 scraper.parse_articles()
 scraper.cache_content()
WebScraper.save_content()
```

Question 2

It's Scraper, not Scrapper

Question 3

Okay, amended. :D

Question 4

beauty is not a conventional alias for BeautifulSoup; you're better off using the original.

Your cache mechanism is a little strange, and should probably be deleted. It's not as if you've shown an access pattern that warrants this.

The OOP design hasn't particularly captured the right things. The "object" you care about most is an article and there is no class for that in your implementation. Conversely, WebScraper isn't very useful as a class since it doesn't need to carry around any state, and is probably better as a small collection of free functions. These functions can handle depagination as expressed through a generator.

Note that save_content doesn't save the content: it only saves the journal abstracts.

t is the default mode for open and can be omitted.

You jump through a few hoops to derive a filename from the title that will be (probably) compatible with your filesystem. This ends up with the worst of all worlds: a title that doesn't look as good as the original, which still has no guarantee of filesystem compatibility. Instead, there's a perfectly good article ID as the last part of the URL that you can use. If you still did want to title-mangle for a filename, consider instead a regex sub.

I don't find it useful to carry the concept of a page, which only has meaning in the web search interface, to your filesystem. You could just save these files flat.

The broader question is: why? When you pull an abstract and discard its DOM structure, the resulting text has lost all of its formatting, including some fairly important superscript citations, etc. Why not just save the HTML? You could save an index csv with the additional fields that I've demonstrated you can parse, and then one .html per article.

When you're able, use a soup strainer to constrain the parse breadth.

The Accept-Language that you pass is less important (and is ignored by the server); the more important Accept should be passed which says that you only understand text/html.

Don't input() without passing a prompt to give the user a clue as to why your program mysteriously hung.

Suggested

This does not include file saving.

from datetime import date
from textwrap import wrap
from typing import NamedTuple, Iterator, Optional
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from bs4.element import SoupStrainer, Tag
from requests import Session
BASE_URL = 'https://www.nature.com'
SEARCH_STRAINER = SoupStrainer('ul', class_='app-article-list-row')
ARTICLE_STRAINER = SoupStrainer('div', id='Abs1-content')
class Article(NamedTuple):
 title: str
 authors: tuple[str, ...]
 article_type: str
 date: date
 url: str
 description: Optional[str]
 @classmethod
 def from_tag(cls, tag: Tag) -> 'Article':
 anchor = tag.find('a', class_='c-card__link')
 authors = tuple(
 span.text
 for span in tag.find('span', itemprop='name')
 )
 article_type = tag.find('span', class_='c-meta__type').text
 article_date = date.fromisoformat(
 tag.find('time')['datetime']
 )
 description = tag.find('div', itemprop='description')
 return cls(
 title=anchor.text,
 url=urljoin(BASE_URL, anchor['href']),
 authors=authors,
 article_type=article_type,
 date=article_date,
 description=description and description.text.strip(),
 )
 @property
 def id(self) -> str:
 return self.url.rsplit('/', 1)[1]
def get_abstract(session: Session, url: str) -> str:
 with session.get(
 url=url,
 headers={'Accept': 'text/html'},
 ) as resp:
 resp.raise_for_status()
 html = resp.text
 doc = BeautifulSoup(markup=html, features='html.parser', parse_only=ARTICLE_STRAINER)
 return doc.text
def articles_from_html(html: str) -> Iterator['Article']:
 doc = BeautifulSoup(markup=html, features='html.parser', parse_only=SEARCH_STRAINER)
 for tag in doc.find_all('article'):
 yield Article.from_tag(tag)
def scrape_one(
 session: Session,
 article_type: str,
 year: int,
 page: int,
 sort: str = 'PubDate',
) -> Iterator[Article]:
 params = {
 'searchType': 'journalSearch',
 'sort': sort,
 'type': article_type,
 'year': year,
 'page': page,
 }
 with session.get(
 url=urljoin(BASE_URL, '/nature/articles'),
 headers={'Accept': 'text/html'},
 params=params,
 ) as resp:
 resp.raise_for_status()
 html = resp.text
 yield from articles_from_html(html)
def scrape(
 session: Session,
 article_type: str,
 year: int,
 pages: int,
 sort: str = 'PubDate',
) -> Iterator[Article]:
 for page in range(1, pages+1):
 yield from scrape_one(session, article_type, year, page, sort)
def test() -> None:
 with Session() as session:
 for article in scrape(
 session=session,
 article_type='article',
 year=2020,
 pages=2,
 ):
 print(article)
 print('\n'.join(wrap(
 get_abstract(session, article.url)
 )))
 print()
if __name__ == '__main__':
 test()

Output

Article(title='Nociceptive nerves regulate haematopoietic stem cell mobilization', authors=('Xin Gao',), article_type='Article', date=datetime.date(2020, 12, 23), url='https://www.nature.com/articles/s41586-020-03057-y', description='Stimulation of pain-sensing neurons, which can be achieved in mice by the ingestion of capsaicin, promotes the migration of haematopoietic stem cells from the bone marrow into the blood.')
Haematopoietic stem cells (HSCs) reside in specialized
microenvironments in the bone marrow—often referred to as
‘niches’—that represent complex regulatory milieux influenced by
multiple cellular constituents, including nerves1,2. Although
sympathetic nerves are known to regulate the HSC niche3,4,5,6, the
contribution of nociceptive neurons in the bone marrow remains
unclear. Here we show that nociceptive nerves are required for
enforced HSC mobilization and that they collaborate with sympathetic
nerves to maintain HSCs in the bone marrow. Nociceptor neurons drive
granulocyte colony-stimulating factor (G-CSF)-induced HSC mobilization
via the secretion of calcitonin gene-related peptide (CGRP). Unlike
sympathetic nerves, which regulate HSCs indirectly via the niche3,4,6,
CGRP acts directly on HSCs via receptor activity modifying protein 1
(RAMP1) and the calcitonin receptor-like receptor (CALCRL) to promote
egress by activating the Gαs/adenylyl cyclase/cAMP pathway. The
ingestion of food containing capsaicin—a natural component of chili
peppers that can trigger the activation of nociceptive
neurons—significantly enhanced HSC mobilization in mice. Targeting the
nociceptive nervous system could therefore represent a strategy to
improve the yield of HSCs for stem cell-based therapeutic agents.
...

Question 5

Thank you very much for such a detailed response. Although, some elements of your solution went straight over my head (never used Session or yield before and don't understand how is 'from_tag' method's return type is a string(?) 'Article'), I'm looking forward to decipher it all. :)

Question 6

This comments section isn't the right place to explain those topics, but I am happy to help in chat.stackexchange.com/rooms/134182/…

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2022-02-15 02:24:15Z

beauty is not a conventional alias for BeautifulSoup; you're better off using the original.

Your cache mechanism is a little strange, and should probably be deleted. It's not as if you've shown an access pattern that warrants this.

The OOP design hasn't particularly captured the right things. The "object" you care about most is an article and there is no class for that in your implementation. Conversely, WebScraper isn't very useful as a class since it doesn't need to carry around any state, and is probably better as a small collection of free functions. These functions can handle depagination as expressed through a generator.

Note that save_content doesn't save the content: it only saves the journal abstracts.

t is the default mode for open and can be omitted.

You jump through a few hoops to derive a filename from the title that will be (probably) compatible with your filesystem. This ends up with the worst of all worlds: a title that doesn't look as good as the original, which still has no guarantee of filesystem compatibility. Instead, there's a perfectly good article ID as the last part of the URL that you can use. If you still did want to title-mangle for a filename, consider instead a regex sub.

I don't find it useful to carry the concept of a page, which only has meaning in the web search interface, to your filesystem. You could just save these files flat.

The broader question is: why? When you pull an abstract and discard its DOM structure, the resulting text has lost all of its formatting, including some fairly important superscript citations, etc. Why not just save the HTML? You could save an index csv with the additional fields that I've demonstrated you can parse, and then one .html per article.

When you're able, use a soup strainer to constrain the parse breadth.

The Accept-Language that you pass is less important (and is ignored by the server); the more important Accept should be passed which says that you only understand text/html.

Don't input() without passing a prompt to give the user a clue as to why your program mysteriously hung.

Suggested

This does not include file saving.

from datetime import date
from textwrap import wrap
from typing import NamedTuple, Iterator, Optional
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from bs4.element import SoupStrainer, Tag
from requests import Session
BASE_URL = 'https://www.nature.com'
SEARCH_STRAINER = SoupStrainer('ul', class_='app-article-list-row')
ARTICLE_STRAINER = SoupStrainer('div', id='Abs1-content')
class Article(NamedTuple):
 title: str
 authors: tuple[str, ...]
 article_type: str
 date: date
 url: str
 description: Optional[str]
 @classmethod
 def from_tag(cls, tag: Tag) -> 'Article':
 anchor = tag.find('a', class_='c-card__link')
 authors = tuple(
 span.text
 for span in tag.find('span', itemprop='name')
 )
 article_type = tag.find('span', class_='c-meta__type').text
 article_date = date.fromisoformat(
 tag.find('time')['datetime']
 )
 description = tag.find('div', itemprop='description')
 return cls(
 title=anchor.text,
 url=urljoin(BASE_URL, anchor['href']),
 authors=authors,
 article_type=article_type,
 date=article_date,
 description=description and description.text.strip(),
 )
 @property
 def id(self) -> str:
 return self.url.rsplit('/', 1)[1]
def get_abstract(session: Session, url: str) -> str:
 with session.get(
 url=url,
 headers={'Accept': 'text/html'},
 ) as resp:
 resp.raise_for_status()
 html = resp.text
 doc = BeautifulSoup(markup=html, features='html.parser', parse_only=ARTICLE_STRAINER)
 return doc.text
def articles_from_html(html: str) -> Iterator['Article']:
 doc = BeautifulSoup(markup=html, features='html.parser', parse_only=SEARCH_STRAINER)
 for tag in doc.find_all('article'):
 yield Article.from_tag(tag)
def scrape_one(
 session: Session,
 article_type: str,
 year: int,
 page: int,
 sort: str = 'PubDate',
) -> Iterator[Article]:
 params = {
 'searchType': 'journalSearch',
 'sort': sort,
 'type': article_type,
 'year': year,
 'page': page,
 }
 with session.get(
 url=urljoin(BASE_URL, '/nature/articles'),
 headers={'Accept': 'text/html'},
 params=params,
 ) as resp:
 resp.raise_for_status()
 html = resp.text
 yield from articles_from_html(html)
def scrape(
 session: Session,
 article_type: str,
 year: int,
 pages: int,
 sort: str = 'PubDate',
) -> Iterator[Article]:
 for page in range(1, pages+1):
 yield from scrape_one(session, article_type, year, page, sort)
def test() -> None:
 with Session() as session:
 for article in scrape(
 session=session,
 article_type='article',
 year=2020,
 pages=2,
 ):
 print(article)
 print('\n'.join(wrap(
 get_abstract(session, article.url)
 )))
 print()
if __name__ == '__main__':
 test()

Output

Article(title='Nociceptive nerves regulate haematopoietic stem cell mobilization', authors=('Xin Gao',), article_type='Article', date=datetime.date(2020, 12, 23), url='https://www.nature.com/articles/s41586-020-03057-y', description='Stimulation of pain-sensing neurons, which can be achieved in mice by the ingestion of capsaicin, promotes the migration of haematopoietic stem cells from the bone marrow into the blood.')
Haematopoietic stem cells (HSCs) reside in specialized
microenvironments in the bone marrow—often referred to as
‘niches’—that represent complex regulatory milieux influenced by
multiple cellular constituents, including nerves1,2. Although
sympathetic nerves are known to regulate the HSC niche3,4,5,6, the
contribution of nociceptive neurons in the bone marrow remains
unclear. Here we show that nociceptive nerves are required for
enforced HSC mobilization and that they collaborate with sympathetic
nerves to maintain HSCs in the bone marrow. Nociceptor neurons drive
granulocyte colony-stimulating factor (G-CSF)-induced HSC mobilization
via the secretion of calcitonin gene-related peptide (CGRP). Unlike
sympathetic nerves, which regulate HSCs indirectly via the niche3,4,6,
CGRP acts directly on HSCs via receptor activity modifying protein 1
(RAMP1) and the calcitonin receptor-like receptor (CALCRL) to promote
egress by activating the Gαs/adenylyl cyclase/cAMP pathway. The
ingestion of food containing capsaicin—a natural component of chili
peppers that can trigger the activation of nociceptive
neurons—significantly enhanced HSC mobilization in mice. Targeting the
nociceptive nervous system could therefore represent a strategy to
improve the yield of HSCs for stem cell-based therapeutic agents.
...

Thank you very much for such a detailed response. Although, some elements of your solution went straight over my head (never used Session or yield before and don't understand how is 'from_tag' method's return type is a string(?) 'Article'), I'm looking forward to decipher it all. :)
This comments section isn't the right place to explain those topics, but I am happy to help in chat.stackexchange.com/rooms/134182/…

Stack Exchange Network

Scrape a URL for articles to save

1 Answer 1

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Scrape a URL for articles to save

1 Answer 1

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions