Scraping Paginated Sites Without Getting It Wrong

DEV Community

Pattern 1: Page number in the URL

The simplest pattern. The URL contains a page parameter — either as a query string (?page=2) or as part of the path (/catalogue/page-2.html). Increment it until you get a 404 or an empty result set.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE = "https://books.toscrape.com/catalogue/"
session = requests.Session()
session.headers.update({
 "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
 "Accept-Language": "en-GB,en;q=0.9",
})
all_books = []
page = 1
while True:
 url = f"https://books.toscrape.com/catalogue/page-{page}.html"
 resp = session.get(url, timeout=15)
 if resp.status_code == 404:
 break # past the last page

 resp.encoding = "utf-8"
 soup = BeautifulSoup(resp.text, "html.parser")
 books = soup.find_all("article", class_="product_pod")
 if not books:
 break # empty page — also done

 for book in books:
 all_books.append({
 "title": book.find("h3").find("a")["title"],
 "price": book.find("p", class_="price_color").text.strip(),
 "rating": book.find("p", class_="star-rating")["class"][1],
 })
 print(f"Page {page}: {len(books)} books")
 page += 1
print(f"\nTotal: {len(all_books)} books")

Two termination conditions, not one. Some sites return an empty 200 for out-of-range pages rather than a 404. Check both.

Pattern 2: Following the "next" link

A cleaner approach for HTML-paginated sites: let the page tell you where to go next, rather than constructing URLs yourself. Most paginated sites include a "Next" link in the HTML. Follow it until it disappears.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
BASE = "https://books.toscrape.com/catalogue/"
session = requests.Session()
session.headers.update({
 "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
 "Accept-Language": "en-GB,en;q=0.9",
})
url = "https://books.toscrape.com/catalogue/page-1.html"
all_books = []
while url:
 resp = session.get(url, timeout=15)
 resp.encoding = "utf-8"
 soup = BeautifulSoup(resp.text, "html.parser")
 for book in soup.find_all("article", class_="product_pod"):
 all_books.append({
 "title": book.find("h3").find("a")["title"],
 "price": book.find("p", class_="price_color").text.strip(),
 "rating": book.find("p", class_="star-rating")["class"][1],
 })
 next_btn = soup.select_one("li.next a")
 url = urljoin(BASE, next_btn["href"]) if next_btn else None
print(f"Scraped {len(all_books)} books")

The urljoin(BASE, next_btn["href"]) call is worth noting. The href in a "next" link is often relative (page-2.html, ../page-2.html). urljoin resolves it against the base URL correctly regardless of what form the relative path takes. Concatenating strings instead will break on unusual relative paths.

Pattern 3: API cursor / continuation token

JSON APIs often paginate differently. Instead of page numbers, they return a token or flag telling you whether more results exist, and sometimes a cursor to pass back on the next request.

The simplest version: a has_next boolean.

import requests
session = requests.Session()
session.headers.update({
 "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
 "Accept": "application/json",
 "Accept-Language": "en-GB,en;q=0.9",
})
all_quotes = []
page = 1
while True:
 resp = session.get(
 "https://quotes.toscrape.com/api/quotes",
 params={"page": page},
 timeout=15,
 )
 resp.raise_for_status()
 data = resp.json()
 all_quotes.extend(data["quotes"])
 print(f"Page {page}: {len(data['quotes'])} quotes")
 if not data["has_next"]:
 break
 page += 1
print(f"\nTotal: {len(all_quotes)} quotes")

Some APIs use a cursor instead — the response includes a next_cursor or next_page_token field that you pass as a parameter on the subsequent request. The structure changes but the loop logic is the same: keep going until the cursor field is null or absent.

# Generic cursor pattern
params = {"limit": 100}
while True:
 resp = session.get("https://example.com/api/items", params=params, timeout=15)
 data = resp.json()
 items.extend(data["results"])
 cursor = data.get("next_cursor")
 if not cursor:
 break
 params["cursor"] = cursor

Rate limiting

Sending requests as fast as the network allows is not scraping — it's a load test. Most sites will rate-limit or block traffic that arrives faster than a human could generate it. A 1-2 second delay between pages is a reasonable starting point; adjust based on the site's response times and any explicit rate-limit headers it sends.

import time
while url:
 resp = session.get(url, timeout=15)
 # ... process page ...

 next_btn = soup.select_one("li.next a")
 url = urljoin(BASE, next_btn["href"]) if next_btn else None
 if url:
 time.sleep(1) # only sleep if there's another request coming

Sleeping after the last page is unnecessary. Put the sleep before the next request or, as above, after confirming there is a next request.

For higher-volume work, time.sleep with a fixed value is blunt. A better approach uses a random delay within a range — time.sleep(random.uniform(0.5, 2.0)) — which avoids the metronomic request timing that fixed delays produce.

Duplicate URL detection

Some sites have inconsistent pagination — "next" links that eventually loop back, or page parameters that wrap around. A simple seen_urls set catches this before it turns into an infinite loop:

from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
session = requests.Session()
session.headers.update({
 "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
 "Accept-Language": "en-GB,en;q=0.9",
})
BASE = "https://books.toscrape.com/catalogue/"
url = "https://books.toscrape.com/catalogue/page-1.html"
seen_urls = set()
all_books = []
while url:
 if url in seen_urls:
 print(f"Loop detected at {url} — stopping")
 break
 seen_urls.add(url)
 resp = session.get(url, timeout=15)
 resp.encoding = "utf-8"
 soup = BeautifulSoup(resp.text, "html.parser")
 for book in soup.find_all("article", class_="product_pod"):
 all_books.append({
 "title": book.find("h3").find("a")["title"],
 "price": book.find("p", class_="price_color").text.strip(),
 "rating": book.find("p", class_="star-rating")["class"][1],
 })
 next_btn = soup.select_one("li.next a")
 url = urljoin(BASE, next_btn["href"]) if next_btn else None
print(f"Scraped {len(all_books)} books across {len(seen_urls)} pages")

Using Scrapy's CrawlSpider

If you're building on Scrapy, CrawlSpider handles link following automatically via rules. This is the idiomatic Scrapy approach for sites where pagination follows a consistent pattern:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BooksSpider(CrawlSpider):
 name = "books"
 allowed_domains = ["books.toscrape.com"]
 start_urls = ["https://books.toscrape.com/catalogue/page-1.html"]
 rules = (
 # Follow "next page" links
 Rule(
 LinkExtractor(restrict_css="li.next a"),
 callback="parse_page",
 follow=True,
 ),
 )
 def parse_page(self, response):
 for book in response.css("article.product_pod"):
 yield {
 "title": book.css("h3 a::attr(title)").get(),
 "price": book.css("p.price_color::text").get(default="").strip(),
 "rating": book.css("p.star-rating::attr(class)").get(default="").split()[-1],
 }

CrawlSpider deduplicates URLs by default (Scrapy's built-in duplicate filter handles it), respects DOWNLOAD_DELAY in your settings, and handles retries. For a site with straightforward pagination, it removes most of the boilerplate above.

One thing to know: CrawlSpider calls the rules on every response, including the ones your callback generates. If a page both contains items and a "next" link, the rule fires correctly — but if you override parse() directly on a CrawlSpider, you'll break the rule processing. Use a separate callback method, as above.

Quick decision guide

Situation	Approach
URL has `/page/2` or `?page=2`	Increment, stop on 404 or empty
Page has a "Next" link in HTML	Follow href with `urljoin`, stop when absent
JSON API with `has_next` flag	Loop until flag is false
JSON API with cursor/token	Pass cursor back each request, stop when null
Building on Scrapy	`CrawlSpider` + `LinkExtractor`

Tags: python scrapy webscraping tutorial