Scraping next page using BeautifulSoup

Question 1

I have created a script for article scraping - it finds title, subtitle, href-link, and the time of publication. Once retrieved, information is converted to a pandas dataframe, and the link for the next page is returned as well (so that it parses page after page).

Everything works as expected, though I feel there should be an easier -or more elegant- way of loading a subsequent page within main function.

import requests
import pandas as pd
from bs4 import BeautifulSoup
from time import sleep
def read_page(url):
 r = requests.get(url)
 return BeautifulSoup(r.content, "lxml")
def news_scraper(soup):
 BASE = "https://www.pravda.com.ua"
 container = []
 for i in soup.select("div.news.news_all > div"):
 container.append(
 [
 i.a.text, # title
 i.find(class_="article__subtitle").text, # subtitle
 i.div.text, # time
 BASE + i.a["href"], # link
 ]
 )
 dataframe = pd.DataFrame(container, columns=["title", "subtitle", "time", "link"])
 dataframe["date"] = (
 dataframe["link"]
 .str.extract("(\d{4}/\d{2}/\d{2})")[0]
 .str.cat(dataframe["time"], sep=" ")
 )
 next_page = soup.select_one("div.archive-navigation > a.button.button_next")["href"]
 return dataframe.drop("time", axis=1), BASE + next_page
def main(START_URL):
 print(START_URL)
 results = []
 soup = read_page(START_URL)
 df, next_page = news_scraper(soup)
 results.append(df)
 while next_page:
 print(next_page)
 try:
 soup = read_page(next_page)
 df, next_page = news_scraper(soup)
 results.append(df)
 except:
 next_page = False
 sleep(1)
 return pd.concat([r for r in results], ignore_index=True)
if __name__ == "__main__":
 df = main("https://www.pravda.com.ua/archives/date_24122019/")
 assert df.shape == (120, 4) # it's true as of today, 12.26.2019

Question 2

Why do you have a sleep(1) in the for loop?

Question 3

Note, the last consequent page ends with <a href="/archives/" class="button button_next">...</a> link, that assigns /archives/ to next_page . How are you handling that case?

Question 4

@Zchpyvr Since I want to scrape a lot of pages, I thought that making a pause between requests would be necessary to avoid getting banned

Question 5

@RomanPerekhrest Regarding archives I didn't think it all through. Since /archives would throw an error if I tried to scrape it with news_scraper function, I thought I'd just stop at that point and set next_page to False to quit while loop

Question 6

Optimization and restructuring

Function's responsibility

The initial approach makes the read_page function depend on both requests and BeautifulSoup modules (though BeautifulSoup functionality/features is not actually used there). Then, a soup instance is passed to news_scraper(soup) function.
To reduce dependencies let read_page function extract the remote webpage and just return its contents r.content. That will also uncouple news_scraper from soup instance arguments and allow to pass any markup content, making the function more unified.

Namings

BASE = "https://www.pravda.com.ua" within news_scraper function is essentially acting like a local variable. But considering it as a constant - it should be moved out at top level and renamed to a meaningful BASE_URL = "https://www.pravda.com.ua".

i is not a good variable name to reflect a document element in for i in soup.select("div.news.news_all > div"). Good names are node, el, atricle ...

The main function is better renamed to news_to_df to reflect the actual intention.
main(START_URL) - don't give arguments uppercased names, it should be start_url.

Parsing news items and composing "date" value

As you parse webpages (html pages) - specifying html.parser or html5lib (not lxml) is preferable for creating BeautifulSoup instance.

Extracting an article publication time with generic i.div.text would be wrong as a parent node div.article could potentially contain another child div nodes with text content. Therefore, the search query should be more exact: news_time = el.find(class_='article__time').text.
Instead of assigning, traversing and dropping "time" column and aggregating:

dataframe["date"] = (
 dataframe["link"]
 .str.extract("(\d{4}/\d{2}/\d{2})")[0]
 .str.cat(dataframe["time"], sep=" ")
 )

- that all can be eliminated and the date column can be calculated at once by combining the extracted date value (powered by precompiled regex pattern DATE_PAT = re.compile(r'\d{4}/\d{2}/\d{2}')) and news_time value.

Instead of collecting a list of lists - a more robust way is to collect a list of dictionaries like {'title': ..., 'subtitle': ..., 'date': ..., 'link': ...} as that will prevent confusing the order of values for strict list of column names.

Furthermore, instead of appending to list, a sequence of needed dictionaries can be efficiently collected with generator function. See the full implementation below.

The main function (new name: news_to_df)

The while next_page: turned to while True:.

except: - do not use bare except, at least catch basic Exception class: except Exception:.

The repeated blocks of read_page, news_scraper and results.append(df) statements can be reduced to a single block (see below).
One subtle nuance is that the ultimate "next" page will have '/archives/' in its a.button.button_next.href path, signaling the end of paging. It's worth to handle that situation explicitly:

if next_page == '/archives/':
 break

The final optimized solution:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from time import sleep
import re
BASE_URL = "https://www.pravda.com.ua"
DATE_PAT = re.compile(r'\d{4}/\d{2}/\d{2}')
def read_page(url):
 r = requests.get(url)
 return r.content
def _collect_newsitems_gen(articles):
 for el in articles:
 a_node = el.a
 news_time = el.find(class_='article__time').text
 yield {'title': a_node.text, 
 'subtitle': el.find(class_="article__subtitle").text,
 'date': f'{DATE_PAT.search(a_node["href"]).group()} {news_time}',
 'link': f'{BASE_URL}{a_node["href"]}'}
def news_scraper(news_content):
 soup = BeautifulSoup(news_content, "html5lib")
 articles = soup.select("div.news.news_all > div")
 next_page_url = soup.select_one("div.archive-navigation > a.button.button_next")["href"]
 df = pd.DataFrame(list(_collect_newsitems_gen(articles)),
 columns=["title", "subtitle", "date", "link"])
 return df, f'{BASE_URL}{next_page_url}'
def news_to_df(start_url):
 next_page = start_url
 results = []
 while True:
 print(next_page)
 try:
 content = read_page(next_page)
 df, next_page = news_scraper(content)
 results.append(df)
 if next_page == '/archives/':
 break
 except Exception:
 break
 sleep(1)
 return pd.concat([r for r in results], ignore_index=True)
if __name__ == "__main__":
 df = news_to_df("https://www.pravda.com.ua/archives/date_24122019/") 
 assert df.shape == (120, 4) # it's true as of today, 12.26.2019

If printing the final resulting df with print(df.to_string()) - the output would look like below (with cutted the middle part to make it a bit shorter):

https://www.pravda.com.ua/archives/date_24122019/
https://www.pravda.com.ua/archives/date_25122019/
https://www.pravda.com.ua/archives/
 title subtitle date link
0 Голова Закарпаття не зрозумів, за що його звіл... Голова Закарпатської обласної державної адміні... 2019年12月24日 23:36 https://www.pravda.com.ua/news/2019/12/24/7235...
1 Стало відомо коли відновлять будівництво об'єк... На зустрічі представників керівництва ХК Київм... 2019年12月24日 22:41 https://www.pravda.com.uahttps://www.epravda.c...
2 ВАКС продовжив арешт Гримчаку до 14 лютого Вищий антикорупційний продовжив арешт для коли... 2019年12月24日 22:25 https://www.pravda.com.ua/news/2019/12/24/7235...
3 Економічні новини 24 грудня: транзит газу, зни... Про транзит газу, про зниження "платіжок", про... 2019年12月24日 22:10 https://www.pravda.com.uahttps://www.epravda.c...
4 Трамп: США готові до будь-якого "різдвяного по... Президент США Дональд Трамп на тлі побоювань щ... 2019年12月24日 22:00 https://www.pravda.com.uahttps://www.eurointeg...
5 У податковій слідчі дії – електронні сервіси п... Державна податкова служба попереджає, що елект... 2019年12月24日 21:55 https://www.pravda.com.ua/news/2019/12/24/7235...
6 Мінфін знизив ставки за держборгом до 11% річних Міністерство фінансів знизило середньозважену ... 2019年12月24日 21:31 https://www.pravda.com.uahttps://www.epravda.c...
7 Україна викреслила зі списку на обмін ексберку... Російський адвокат Валентин Рибін заявляє, що ... 2019年12月24日 21:13 https://www.pravda.com.ua/news/2019/12/24/7235...
8 Посол: іспанський клуб покарають за образи укр... Посол України в Іспанії Анатолій Щерба заявив,... 2019年12月24日 20:45 https://www.pravda.com.uahttps://www.eurointeg...
9 Міністр енергетики: "Газпром" може "зістрибнут... У Міністерстві енергетики не виключають, що "Г... 2019年12月24日 20:03 https://www.pravda.com.uahttps://www.epravda.c...
10 Зеленський призначив Арахамію секретарем Націн... Президент Володимир Зеленський затвердив персо... 2019年12月24日 20:00 https://www.pravda.com.ua/news/2019/12/24/7235...
...
110 Уряд придумав, як захистити українців від шкод... Кабінет міністрів схвалив законопроєкт, який з... 2019年12月25日 06:54 https://www.pravda.com.ua/news/2019/12/25/7235...
111 Кіберполіція та YouControl домовилися про спів... Кіберполіція та компанія YouControl підписали ... 2019年12月25日 06:00 https://www.pravda.com.ua/news/2019/12/25/7235...
112 В окупованому Криму продають прикарпатські яли... У центрі Сімферополя, на новорічному ярмарку п... 2019年12月25日 05:11 https://www.pravda.com.ua/news/2019/12/25/7235...
113 У США схожий на Санту чоловік пограбував банк,... У Сполучених Штатах чоловік з білою, як у Сант... 2019年12月25日 04:00 https://www.pravda.com.ua/news/2019/12/25/7235...
114 У Росії за "дитячу порнографію" посадили блоге... Верховний суд російської Чувашії засудив до тр... 2019年12月25日 03:26 https://www.pravda.com.ua/news/2019/12/25/7235...
115 Уряд провів екстрене засідання через газові пе... Кабінет міністрів у вівторок ввечері провів ек... 2019年12月25日 02:31 https://www.pravda.com.ua/news/2019/12/25/7235...
116 Нова стратегія Мінспорту: розвиток інфраструкт... Стратегія розвитку спорту і фізичної активност... 2019年12月25日 02:14 https://www.pravda.com.ua/news/2019/12/25/7235...
117 Милованов розкритикував НБУ за курс гривні та ... Міністр розвитку економіки Тимофій Милованов р... 2019年12月24日 01:46 https://www.pravda.com.uahttps://www.epravda.c...
118 Російські літаки розбомбили школу в Сирії: заг... Щонайменше 10 людей, в тому числі шестеро – ді... 2019年12月25日 01:04 https://www.pravda.com.ua/news/2019/12/25/7235...
119 Ліквідація "майданчиків Яценка": Зеленський пі... Президент Володимир Зеленський підписав закон,... 2019年12月25日 00:27 https://www.pravda.com.ua/news/2019/12/25/7235...

P.S. From Ukraine with love ...

Question 7

Thanks so much, I learnt a lot. Дякую :)

Question 8

@politicalscientist, you're welcome

score 3 · Accepted Answer · 2019-12-26 22:57:25Z

Optimization and restructuring

Function's responsibility

The initial approach makes the read_page function depend on both requests and BeautifulSoup modules (though BeautifulSoup functionality/features is not actually used there). Then, a soup instance is passed to news_scraper(soup) function.
To reduce dependencies let read_page function extract the remote webpage and just return its contents r.content. That will also uncouple news_scraper from soup instance arguments and allow to pass any markup content, making the function more unified.

Namings

BASE = "https://www.pravda.com.ua" within news_scraper function is essentially acting like a local variable. But considering it as a constant - it should be moved out at top level and renamed to a meaningful BASE_URL = "https://www.pravda.com.ua".

i is not a good variable name to reflect a document element in for i in soup.select("div.news.news_all > div"). Good names are node, el, atricle ...

The main function is better renamed to news_to_df to reflect the actual intention.
main(START_URL) - don't give arguments uppercased names, it should be start_url.

Parsing news items and composing "date" value

As you parse webpages (html pages) - specifying html.parser or html5lib (not lxml) is preferable for creating BeautifulSoup instance.

Extracting an article publication time with generic i.div.text would be wrong as a parent node div.article could potentially contain another child div nodes with text content. Therefore, the search query should be more exact: news_time = el.find(class_='article__time').text.
Instead of assigning, traversing and dropping "time" column and aggregating:

dataframe["date"] = (
 dataframe["link"]
 .str.extract("(\d{4}/\d{2}/\d{2})")[0]
 .str.cat(dataframe["time"], sep=" ")
 )

- that all can be eliminated and the date column can be calculated at once by combining the extracted date value (powered by precompiled regex pattern DATE_PAT = re.compile(r'\d{4}/\d{2}/\d{2}')) and news_time value.

Instead of collecting a list of lists - a more robust way is to collect a list of dictionaries like {'title': ..., 'subtitle': ..., 'date': ..., 'link': ...} as that will prevent confusing the order of values for strict list of column names.

Furthermore, instead of appending to list, a sequence of needed dictionaries can be efficiently collected with generator function. See the full implementation below.

The main function (new name: news_to_df)

The while next_page: turned to while True:.

except: - do not use bare except, at least catch basic Exception class: except Exception:.

The repeated blocks of read_page, news_scraper and results.append(df) statements can be reduced to a single block (see below).
One subtle nuance is that the ultimate "next" page will have '/archives/' in its a.button.button_next.href path, signaling the end of paging. It's worth to handle that situation explicitly:

if next_page == '/archives/':
 break

The final optimized solution:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from time import sleep
import re
BASE_URL = "https://www.pravda.com.ua"
DATE_PAT = re.compile(r'\d{4}/\d{2}/\d{2}')
def read_page(url):
 r = requests.get(url)
 return r.content
def _collect_newsitems_gen(articles):
 for el in articles:
 a_node = el.a
 news_time = el.find(class_='article__time').text
 yield {'title': a_node.text, 
 'subtitle': el.find(class_="article__subtitle").text,
 'date': f'{DATE_PAT.search(a_node["href"]).group()} {news_time}',
 'link': f'{BASE_URL}{a_node["href"]}'}
def news_scraper(news_content):
 soup = BeautifulSoup(news_content, "html5lib")
 articles = soup.select("div.news.news_all > div")
 next_page_url = soup.select_one("div.archive-navigation > a.button.button_next")["href"]
 df = pd.DataFrame(list(_collect_newsitems_gen(articles)),
 columns=["title", "subtitle", "date", "link"])
 return df, f'{BASE_URL}{next_page_url}'
def news_to_df(start_url):
 next_page = start_url
 results = []
 while True:
 print(next_page)
 try:
 content = read_page(next_page)
 df, next_page = news_scraper(content)
 results.append(df)
 if next_page == '/archives/':
 break
 except Exception:
 break
 sleep(1)
 return pd.concat([r for r in results], ignore_index=True)
if __name__ == "__main__":
 df = news_to_df("https://www.pravda.com.ua/archives/date_24122019/") 
 assert df.shape == (120, 4) # it's true as of today, 12.26.2019

If printing the final resulting df with print(df.to_string()) - the output would look like below (with cutted the middle part to make it a bit shorter):

https://www.pravda.com.ua/archives/date_24122019/
https://www.pravda.com.ua/archives/date_25122019/
https://www.pravda.com.ua/archives/
 title subtitle date link
0 Голова Закарпаття не зрозумів, за що його звіл... Голова Закарпатської обласної державної адміні... 2019年12月24日 23:36 https://www.pravda.com.ua/news/2019/12/24/7235...
1 Стало відомо коли відновлять будівництво об'єк... На зустрічі представників керівництва ХК Київм... 2019年12月24日 22:41 https://www.pravda.com.uahttps://www.epravda.c...
2 ВАКС продовжив арешт Гримчаку до 14 лютого Вищий антикорупційний продовжив арешт для коли... 2019年12月24日 22:25 https://www.pravda.com.ua/news/2019/12/24/7235...
3 Економічні новини 24 грудня: транзит газу, зни... Про транзит газу, про зниження "платіжок", про... 2019年12月24日 22:10 https://www.pravda.com.uahttps://www.epravda.c...
4 Трамп: США готові до будь-якого "різдвяного по... Президент США Дональд Трамп на тлі побоювань щ... 2019年12月24日 22:00 https://www.pravda.com.uahttps://www.eurointeg...
5 У податковій слідчі дії – електронні сервіси п... Державна податкова служба попереджає, що елект... 2019年12月24日 21:55 https://www.pravda.com.ua/news/2019/12/24/7235...
6 Мінфін знизив ставки за держборгом до 11% річних Міністерство фінансів знизило середньозважену ... 2019年12月24日 21:31 https://www.pravda.com.uahttps://www.epravda.c...
7 Україна викреслила зі списку на обмін ексберку... Російський адвокат Валентин Рибін заявляє, що ... 2019年12月24日 21:13 https://www.pravda.com.ua/news/2019/12/24/7235...
8 Посол: іспанський клуб покарають за образи укр... Посол України в Іспанії Анатолій Щерба заявив,... 2019年12月24日 20:45 https://www.pravda.com.uahttps://www.eurointeg...
9 Міністр енергетики: "Газпром" може "зістрибнут... У Міністерстві енергетики не виключають, що "Г... 2019年12月24日 20:03 https://www.pravda.com.uahttps://www.epravda.c...
10 Зеленський призначив Арахамію секретарем Націн... Президент Володимир Зеленський затвердив персо... 2019年12月24日 20:00 https://www.pravda.com.ua/news/2019/12/24/7235...
...
110 Уряд придумав, як захистити українців від шкод... Кабінет міністрів схвалив законопроєкт, який з... 2019年12月25日 06:54 https://www.pravda.com.ua/news/2019/12/25/7235...
111 Кіберполіція та YouControl домовилися про спів... Кіберполіція та компанія YouControl підписали ... 2019年12月25日 06:00 https://www.pravda.com.ua/news/2019/12/25/7235...
112 В окупованому Криму продають прикарпатські яли... У центрі Сімферополя, на новорічному ярмарку п... 2019年12月25日 05:11 https://www.pravda.com.ua/news/2019/12/25/7235...
113 У США схожий на Санту чоловік пограбував банк,... У Сполучених Штатах чоловік з білою, як у Сант... 2019年12月25日 04:00 https://www.pravda.com.ua/news/2019/12/25/7235...
114 У Росії за "дитячу порнографію" посадили блоге... Верховний суд російської Чувашії засудив до тр... 2019年12月25日 03:26 https://www.pravda.com.ua/news/2019/12/25/7235...
115 Уряд провів екстрене засідання через газові пе... Кабінет міністрів у вівторок ввечері провів ек... 2019年12月25日 02:31 https://www.pravda.com.ua/news/2019/12/25/7235...
116 Нова стратегія Мінспорту: розвиток інфраструкт... Стратегія розвитку спорту і фізичної активност... 2019年12月25日 02:14 https://www.pravda.com.ua/news/2019/12/25/7235...
117 Милованов розкритикував НБУ за курс гривні та ... Міністр розвитку економіки Тимофій Милованов р... 2019年12月24日 01:46 https://www.pravda.com.uahttps://www.epravda.c...
118 Російські літаки розбомбили школу в Сирії: заг... Щонайменше 10 людей, в тому числі шестеро – ді... 2019年12月25日 01:04 https://www.pravda.com.ua/news/2019/12/25/7235...
119 Ліквідація "майданчиків Яценка": Зеленський пі... Президент Володимир Зеленський підписав закон,... 2019年12月25日 00:27 https://www.pravda.com.ua/news/2019/12/25/7235...

P.S. From Ukraine with love ...

\$\begingroup\$ Thanks so much, I learnt a lot. Дякую :) \$\endgroup\$

Hryhorii Pavlenko
– Hryhorii Pavlenko

2019年12月27日 08:36:10 +00:00
Commented Dec 27, 2019 at 8:36
\$\begingroup\$ @politicalscientist, you're welcome \$\endgroup\$

RomanPerekhrest
– RomanPerekhrest

2019年12月27日 08:39:36 +00:00
Commented Dec 27, 2019 at 8:39

Stack Exchange Network

Scraping next page using BeautifulSoup

1 Answer 1

Optimization and restructuring

P.S. From Ukraine with love ...

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Scraping next page using BeautifulSoup

1 Answer 1

Optimization and restructuring

P.S. From Ukraine with love ...

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions