Pattern 1: Page number in the URL
The simplest pattern. The URL contains a page parameter — either as a query string (?page=2) or as part of the path (/catalogue/page-2.html). Increment it until you get a 404 or an empty result set.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE = "https://books.toscrape.com/catalogue/"
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-GB,en;q=0.9",
})
all_books = []
page = 1
while True:
url = f"https://books.toscrape.com/catalogue/page-{page}.html"
resp = session.get(url, timeout=15)
if resp.status_code == 404:
break # past the last page
resp.encoding = "utf-8"
soup = BeautifulSoup(resp.text, "html.parser")
books = soup.find_all("article", class_="product_pod")
if not books:
break # empty page — also done
for book in books:
all_books.append({
"title": book.find("h3").find("a")["title"],
"price": book.find("p", class_="price_color").text.strip(),
"rating": book.find("p", class_="star-rating")["class"][1],
})
print(f"Page {page}: {len(books)} books")
page += 1
print(f"\nTotal: {len(all_books)} books")
Two termination conditions, not one. Some sites return an empty 200 for out-of-range pages rather than a 404. Check both.
Pattern 2: Following the "next" link
A cleaner approach for HTML-paginated sites: let the page tell you where to go next, rather than constructing URLs yourself. Most paginated sites include a "Next" link in the HTML. Follow it until it disappears.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE = "https://books.toscrape.com/catalogue/"
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-GB,en;q=0.9",
})
url = "https://books.toscrape.com/catalogue/page-1.html"
all_books = []
while url:
resp = session.get(url, timeout=15)
resp.encoding = "utf-8"
soup = BeautifulSoup(resp.text, "html.parser")
for book in soup.find_all("article", class_="product_pod"):
all_books.append({
"title": book.find("h3").find("a")["title"],
"price": book.find("p", class_="price_color").text.strip(),
"rating": book.find("p", class_="star-rating")["class"][1],
})
next_btn = soup.select_one("li.next a")
url = urljoin(BASE, next_btn["href"]) if next_btn else None
print(f"Scraped {len(all_books)} books")
The urljoin(BASE, next_btn["href"]) call is worth noting. The href in a "next" link is often relative (page-2.html, ../page-2.html). urljoin resolves it against the base URL correctly regardless of what form the relative path takes. Concatenating strings instead will break on unusual relative paths.
Pattern 3: API cursor / continuation token
JSON APIs often paginate differently. Instead of page numbers, they return a token or flag telling you whether more results exist, and sometimes a cursor to pass back on the next request.
The simplest version: a has_next boolean.
import requests
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Accept": "application/json",
"Accept-Language": "en-GB,en;q=0.9",
})
all_quotes = []
page = 1
while True:
resp = session.get(
"https://quotes.toscrape.com/api/quotes",
params={"page": page},
timeout=15,
)
resp.raise_for_status()
data = resp.json()
all_quotes.extend(data["quotes"])
print(f"Page {page}: {len(data['quotes'])} quotes")
if not data["has_next"]:
break
page += 1
print(f"\nTotal: {len(all_quotes)} quotes")
Some APIs use a cursor instead — the response includes a next_cursor or next_page_token field that you pass as a parameter on the subsequent request. The structure changes but the loop logic is the same: keep going until the cursor field is null or absent.
# Generic cursor pattern
params = {"limit": 100}
while True:
resp = session.get("https://example.com/api/items", params=params, timeout=15)
data = resp.json()
items.extend(data["results"])
cursor = data.get("next_cursor")
if not cursor:
break
params["cursor"] = cursor
Rate limiting
Sending requests as fast as the network allows is not scraping — it's a load test. Most sites will rate-limit or block traffic that arrives faster than a human could generate it. A 1-2 second delay between pages is a reasonable starting point; adjust based on the site's response times and any explicit rate-limit headers it sends.
import time
while url:
resp = session.get(url, timeout=15)
# ... process page ...
next_btn = soup.select_one("li.next a")
url = urljoin(BASE, next_btn["href"]) if next_btn else None
if url:
time.sleep(1) # only sleep if there's another request coming
Sleeping after the last page is unnecessary. Put the sleep before the next request or, as above, after confirming there is a next request.
For higher-volume work, time.sleep with a fixed value is blunt. A better approach uses a random delay within a range — time.sleep(random.uniform(0.5, 2.0)) — which avoids the metronomic request timing that fixed delays produce.
Duplicate URL detection
Some sites have inconsistent pagination — "next" links that eventually loop back, or page parameters that wrap around. A simple seen_urls set catches this before it turns into an infinite loop:
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-GB,en;q=0.9",
})
BASE = "https://books.toscrape.com/catalogue/"
url = "https://books.toscrape.com/catalogue/page-1.html"
seen_urls = set()
all_books = []
while url:
if url in seen_urls:
print(f"Loop detected at {url} — stopping")
break
seen_urls.add(url)
resp = session.get(url, timeout=15)
resp.encoding = "utf-8"
soup = BeautifulSoup(resp.text, "html.parser")
for book in soup.find_all("article", class_="product_pod"):
all_books.append({
"title": book.find("h3").find("a")["title"],
"price": book.find("p", class_="price_color").text.strip(),
"rating": book.find("p", class_="star-rating")["class"][1],
})
next_btn = soup.select_one("li.next a")
url = urljoin(BASE, next_btn["href"]) if next_btn else None
print(f"Scraped {len(all_books)} books across {len(seen_urls)} pages")
Using Scrapy's CrawlSpider
If you're building on Scrapy, CrawlSpider handles link following automatically via rules. This is the idiomatic Scrapy approach for sites where pagination follows a consistent pattern:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BooksSpider(CrawlSpider):
name = "books"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com/catalogue/page-1.html"]
rules = (
# Follow "next page" links
Rule(
LinkExtractor(restrict_css="li.next a"),
callback="parse_page",
follow=True,
),
)
def parse_page(self, response):
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css("p.price_color::text").get(default="").strip(),
"rating": book.css("p.star-rating::attr(class)").get(default="").split()[-1],
}
CrawlSpider deduplicates URLs by default (Scrapy's built-in duplicate filter handles it), respects DOWNLOAD_DELAY in your settings, and handles retries. For a site with straightforward pagination, it removes most of the boilerplate above.
One thing to know: CrawlSpider calls the rules on every response, including the ones your callback generates. If a page both contains items and a "next" link, the rule fires correctly — but if you override parse() directly on a CrawlSpider, you'll break the rule processing. Use a separate callback method, as above.
Quick decision guide
| Situation |
Approach |
URL has /page/2 or ?page=2
|
Increment, stop on 404 or empty |
| Page has a "Next" link in HTML |
Follow href with urljoin, stop when absent |
JSON API with has_next flag |
Loop until flag is false |
| JSON API with cursor/token |
Pass cursor back each request, stop when null |
| Building on Scrapy |
CrawlSpider + LinkExtractor
|
Tags: python scrapy webscraping tutorial