Web scraping and design pattern lifting

Question 1

I would like to get some feed back on my code; the goal is to get all agencies address of banks. I wrote a pretty simple brute force algorithm. I was wondering if you would have any advice to improve the code, design it differently (do I need an OOP approach here) etc .

import requests
from lxml import html
groupe = "credit-agricole"
dep = '21'
groupes =["credit-agricole"]
deps = ['53', '44', '56', '35', '22', '49', '72', '29', '85']
def get_nb_pages(groupe,dep):
 """ Return nb_pages ([int]): number of pages containing banks information .
 Args:
 groupe ([string]): bank groupe ("credit-agricole",...)
 dep ([string]): departement ("01",...)
 """
 url ="https://www.moneyvox.fr/pratique/agences/{groupe}/{dep}".format(groupe=groupe,dep=dep)
 req = requests.get(url)
 raw_html = req.text
 xpath = "/html/body/div[2]/article/div/div/div[3]/div[2]/nav/a"
 tree = html.fromstring(raw_html)
 nb_pages = len(tree.xpath(xpath)) +1
 
 return nb_pages
def get_agencies(groupe,dep,page_num):
 """ Return agencies ([List]): description of agencies scrapped on website target page.
 Args:
 groupe ([string]): bank groupe ("credit-agricole",...)
 dep ([string]): departement ("01",...)
 page_num ([int]): target page
 """
 url ="https://www.moneyvox.fr/pratique/agences/{groupe}/{dep}/{page_num}".format(groupe=groupe,dep=dep,page_num=page_num)
 req = requests.get(url)
 raw_html = req.text
 xpath = '//div[@class="lh-bloc-agence like-text"]'
 tree = html.fromstring(raw_html)
 blocs_agencies = tree.xpath(xpath)
 agencies = [] 
 for bloc in blocs_agencies:
 agence = bloc.xpath("div/div[1]/h4")[0].text
 rue = bloc.xpath("div/div[1]/p[1]")[0].text
 code_postale = bloc.xpath("div/div[1]/p[2]")[0].text
 agencies.append((agence,rue,code_postale))
 return agencies
 
def get_all(groupes,deps):
 """Return all_agencies ([List]): description of agencies scrapped.
 Args:
 groupes ([List]): target groups
 deps ([List]): target departments
 """
 all_agencies = []
 for groupe in groupes:
 for dep in deps:
 nb_pages = get_nb_pages(groupe,dep)
 for p in range(1,nb_pages+1):
 agencies = get_agencies(groupe,dep,p)
 all_agencies.extend(agencies) 
 df_agencies = pd.DataFrame(all_agencies,columns=['agence','rue','code_postale'])
 return df_agencies
 
get_nb_pages(groupe,dep)
get_agencies(groupe,dep,1)
df_agencies = get_all(groupes,deps)
```

Question 2

It's fine for your strings - and scraped web content - to be localised in French; but ensure that your variables are in English (groupe -> group) for consistency
Prefer tuples over lists when you have immutable data
Add PEP484 type hints when possible
Do not leave those first four variables in global scope; move them to a function
Consider using f-strings instead of format calls
Always check to see if your requests calls fail; the easiest way is via raise_for_status
Tell requests when you're done with a response via context management
Use actual integers for your department numbers instead of stringly-typed data
Consider using an intermediate dataclass for your agency data instead of implicit tuples
Consider using generator functions (yield) to simplify your iterative code

First Suggested

from dataclasses import dataclass, astuple
from typing import Iterable, Collection
import pandas as pd
import requests
from lxml import html
from lxml.html import HtmlElement
@dataclass
class Agency:
 name: str
 street: str
 postal_code: str
 @classmethod
 def from_block(cls, block: HtmlElement) -> 'Agency':
 return cls(
 name=block.xpath("div/div[1]/h4")[0].text,
 street=block.xpath("div/div[1]/p[1]")[0].text,
 postal_code=block.xpath("div/div[1]/p[2]")[0].text,
 )
def get_nb_pages(group: str, department: int) -> int:
 """ Return nb_pages ([int]): number of pages containing banks information .
 Args:
 groupe ([string]): bank groupe ("credit-agricole",...)
 department ([string]): departement ("01",...)
 """
 url = f"https://www.moneyvox.fr/pratique/agences/{group}/{department}"
 with requests.get(url) as req:
 req.raise_for_status()
 raw_html = req.text
 xpath = "/html/body/div[2]/article/div/div/div[3]/div[2]/nav/a"
 tree = html.fromstring(raw_html)
 return len(tree.xpath(xpath)) + 1
def get_agencies(group: str, department: int, page_num: int) -> Iterable[Agency]:
 """ Return agencies ([List]): description of agencies scrapped on website target page.
 Args:
 groupe ([string]): bank groupe ("credit-agricole",...)
 department ([string]): departement ("01",...)
 page_num ([int]): target page
 """
 url = f"https://www.moneyvox.fr/pratique/agences/{group}/{department}/{page_num}"
 with requests.get(url) as req:
 req.raise_for_status()
 raw_html = req.text
 xpath = '//div[@class="lh-bloc-agence like-text"]'
 tree = html.fromstring(raw_html)
 for block in tree.xpath(xpath):
 yield Agency.from_block(block)
def get_all(groups: Iterable[str], departments: Collection[int]):
 """Return all_agencies ([List]): description of agencies scrapped.
 Args:
 groupes ([List]): target groups
 departments ([List]): target departments
 """
 for group in groups:
 for department in departments:
 nb_pages = get_nb_pages(group, department)
 for page in range(1, nb_pages + 1):
 yield from get_agencies(group, department, page)
def main():
 group = "credit-agricole"
 department = 21
 groups = ("credit-agricole",)
 departments = (53, 44,) # ... 56, 35, 22, 49, 72, 29, 85)
 n_pages = get_nb_pages(group, department)
 agencies = tuple(get_agencies(group, department, page_num=1))
 all_agencies = get_all(groups, departments)
 df_agencies = pd.DataFrame(
 (astuple(agency) for agency in all_agencies),
 columns=('agence', 'rue', 'code_postale'),
 )
if __name__ == '__main__':
 main()

All of that being the case, your approach using xpath selectors is very fragile. Here is an alternate approach that uses named elements with classes and IDs where available. It is incomplete because I think the site rate-limited my IP, which is of course a direct risk of scraping and totally within the rights of the website.

BeautifulSoup Alternate

import re
from dataclasses import dataclass, astuple
from typing import Iterable, Dict, ClassVar, Pattern
from bs4 import BeautifulSoup, Tag
import pandas as pd
from requests import Session
ROOT = 'https://www.moneyvox.fr'
@dataclass
class Branch:
 name: str
 street: str
 city: str
 postal_code: str
 path: str
 @classmethod
 def scrape_all(cls, session: Session, path: str) -> Iterable['Branch']:
 page = ''
 while True:
 with session.get(ROOT + path + page) as response:
 response.raise_for_status()
 doc = BeautifulSoup(response.text, 'xml')
 body = doc.select_one('div.main-body')
 city = None
 for head_or_cell in body.select('h2, div.lh-bloc-agence'):
 if head_or_cell.name == 'h2':
 city = head_or_cell.text
 elif head_or_cell.name == 'div':
 street, postal_code = head_or_cell.select('p')
 yield cls(
 name=head_or_cell.h4.text,
 street=street.text,
 city=city,
 postal_code=postal_code.text,
 path=head_or_cell.select_one('a.lh-btn-info')['href'],
 )
 # perform depagination here
@dataclass
class Department:
 name: str
 code: str
 path: str
 n_branches: int
 re_count: ClassVar[Pattern] = re.compile(r'\d+')
 @classmethod
 def from_li(cls, li: Tag) -> 'Department':
 return cls(
 name=li.strong.text,
 path=li.a['href'],
 code=cls.re_count.search(li.a.text)[0],
 n_branches=int(cls.re_count.search(li.em.text)[0]),
 )
@dataclass
class Agency:
 name: str
 category: str
 path: str
 @classmethod
 def scrape_all(cls, session: Session) -> Iterable['Agency']:
 with session.get(ROOT + '/pratique/agences') as response:
 response.raise_for_status()
 doc = BeautifulSoup(response.text, 'xml')
 body = doc.select_one('div.main-body')
 category = None
 for head_or_cell in body.select('h2, a.lh-lien-bloc-liste'):
 if head_or_cell.name == 'h2':
 category = head_or_cell.text
 elif head_or_cell.name == 'a':
 yield cls(
 name=head_or_cell.text,
 category=category,
 path=head_or_cell['href'],
 )
 def get_departments(self, session: Session) -> Dict[str, int]:
 with session.get(ROOT + self.path) as response:
 response.raise_for_status()
 doc = BeautifulSoup(response.text, 'xml')
 for li in doc.select('#tabs-departement li'):
 yield Department.from_li(li)
 def __str__(self):
 return self.name
def main():
 pd.set_option('display.max_columns', None)
 pd.set_option('display.width', None)
 with Session() as session:
 agencies = {a.name: a for a in Agency.scrape_all(session)}
 agency_df = pd.DataFrame(
 (astuple(a) for a in agencies.values()),
 columns=('Nom', 'Catégorie', 'Lien'),
 )
 print(agency_df)
 agency = agencies['Crédit Agricole']
 departments = {d.name: d for d in agency.get_departments(session)}
 department = departments['Ardennes']
 branches = {b.name: b for b in Branch.scrape_all(session, department.path)}
if __name__ == '__main__':
 main()

Question 3

To emphasize what Reinderien already said, always check the return of your calls to requests. status_code should return 200. If you get anything else you should stop and investigate. It is possible that the website is blocking you and there is no point running blind.

Also, I recommend that you spoof the user agent, otherwise it is obvious to the web server that you are running a bot and they may block you, or apply more stringent rate limiting measures than a casual user would experience. By default the user agent would be something like this: python-requests/2.25.1.

And since you are making repeated calls, you should use a session instead. Reinderien already refactored your code with session but did not mention this point explicitly. The benefits are persistence (when using cookies for instance) and also more efficient connection pooling at TCP level. And you can use session to set default headers for all your requests.

Example:

>>> session = requests.session()
>>> session.headers
{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>>

Change the user agent for the session, here we spoof Firefox:

session.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0'})

And then you can use session.get to retrieve pages.

May you could be interested in prepared queries too. I strongly recommend that Python developers get acquainted with that section of the documentation.

To speed up the process you could also add parallel processing, for example using threads. But be gentle, if you have too many open connections at the same time, it is one thing that can get you blocked. Usage patterns have to remain reasonable and human-like.

Reinderien 71.2k5 gold badges76 silver badges257 bronze badges · Answer 1 · 2021-04-02 04:18:37Z

It's fine for your strings - and scraped web content - to be localised in French; but ensure that your variables are in English (groupe -> group) for consistency
Prefer tuples over lists when you have immutable data
Add PEP484 type hints when possible
Do not leave those first four variables in global scope; move them to a function
Consider using f-strings instead of format calls
Always check to see if your requests calls fail; the easiest way is via raise_for_status
Tell requests when you're done with a response via context management
Use actual integers for your department numbers instead of stringly-typed data
Consider using an intermediate dataclass for your agency data instead of implicit tuples
Consider using generator functions (yield) to simplify your iterative code

First Suggested

from dataclasses import dataclass, astuple
from typing import Iterable, Collection
import pandas as pd
import requests
from lxml import html
from lxml.html import HtmlElement
@dataclass
class Agency:
 name: str
 street: str
 postal_code: str
 @classmethod
 def from_block(cls, block: HtmlElement) -> 'Agency':
 return cls(
 name=block.xpath("div/div[1]/h4")[0].text,
 street=block.xpath("div/div[1]/p[1]")[0].text,
 postal_code=block.xpath("div/div[1]/p[2]")[0].text,
 )
def get_nb_pages(group: str, department: int) -> int:
 """ Return nb_pages ([int]): number of pages containing banks information .
 Args:
 groupe ([string]): bank groupe ("credit-agricole",...)
 department ([string]): departement ("01",...)
 """
 url = f"https://www.moneyvox.fr/pratique/agences/{group}/{department}"
 with requests.get(url) as req:
 req.raise_for_status()
 raw_html = req.text
 xpath = "/html/body/div[2]/article/div/div/div[3]/div[2]/nav/a"
 tree = html.fromstring(raw_html)
 return len(tree.xpath(xpath)) + 1
def get_agencies(group: str, department: int, page_num: int) -> Iterable[Agency]:
 """ Return agencies ([List]): description of agencies scrapped on website target page.
 Args:
 groupe ([string]): bank groupe ("credit-agricole",...)
 department ([string]): departement ("01",...)
 page_num ([int]): target page
 """
 url = f"https://www.moneyvox.fr/pratique/agences/{group}/{department}/{page_num}"
 with requests.get(url) as req:
 req.raise_for_status()
 raw_html = req.text
 xpath = '//div[@class="lh-bloc-agence like-text"]'
 tree = html.fromstring(raw_html)
 for block in tree.xpath(xpath):
 yield Agency.from_block(block)
def get_all(groups: Iterable[str], departments: Collection[int]):
 """Return all_agencies ([List]): description of agencies scrapped.
 Args:
 groupes ([List]): target groups
 departments ([List]): target departments
 """
 for group in groups:
 for department in departments:
 nb_pages = get_nb_pages(group, department)
 for page in range(1, nb_pages + 1):
 yield from get_agencies(group, department, page)
def main():
 group = "credit-agricole"
 department = 21
 groups = ("credit-agricole",)
 departments = (53, 44,) # ... 56, 35, 22, 49, 72, 29, 85)
 n_pages = get_nb_pages(group, department)
 agencies = tuple(get_agencies(group, department, page_num=1))
 all_agencies = get_all(groups, departments)
 df_agencies = pd.DataFrame(
 (astuple(agency) for agency in all_agencies),
 columns=('agence', 'rue', 'code_postale'),
 )
if __name__ == '__main__':
 main()

All of that being the case, your approach using xpath selectors is very fragile. Here is an alternate approach that uses named elements with classes and IDs where available. It is incomplete because I think the site rate-limited my IP, which is of course a direct risk of scraping and totally within the rights of the website.

BeautifulSoup Alternate

import re
from dataclasses import dataclass, astuple
from typing import Iterable, Dict, ClassVar, Pattern
from bs4 import BeautifulSoup, Tag
import pandas as pd
from requests import Session
ROOT = 'https://www.moneyvox.fr'
@dataclass
class Branch:
 name: str
 street: str
 city: str
 postal_code: str
 path: str
 @classmethod
 def scrape_all(cls, session: Session, path: str) -> Iterable['Branch']:
 page = ''
 while True:
 with session.get(ROOT + path + page) as response:
 response.raise_for_status()
 doc = BeautifulSoup(response.text, 'xml')
 body = doc.select_one('div.main-body')
 city = None
 for head_or_cell in body.select('h2, div.lh-bloc-agence'):
 if head_or_cell.name == 'h2':
 city = head_or_cell.text
 elif head_or_cell.name == 'div':
 street, postal_code = head_or_cell.select('p')
 yield cls(
 name=head_or_cell.h4.text,
 street=street.text,
 city=city,
 postal_code=postal_code.text,
 path=head_or_cell.select_one('a.lh-btn-info')['href'],
 )
 # perform depagination here
@dataclass
class Department:
 name: str
 code: str
 path: str
 n_branches: int
 re_count: ClassVar[Pattern] = re.compile(r'\d+')
 @classmethod
 def from_li(cls, li: Tag) -> 'Department':
 return cls(
 name=li.strong.text,
 path=li.a['href'],
 code=cls.re_count.search(li.a.text)[0],
 n_branches=int(cls.re_count.search(li.em.text)[0]),
 )
@dataclass
class Agency:
 name: str
 category: str
 path: str
 @classmethod
 def scrape_all(cls, session: Session) -> Iterable['Agency']:
 with session.get(ROOT + '/pratique/agences') as response:
 response.raise_for_status()
 doc = BeautifulSoup(response.text, 'xml')
 body = doc.select_one('div.main-body')
 category = None
 for head_or_cell in body.select('h2, a.lh-lien-bloc-liste'):
 if head_or_cell.name == 'h2':
 category = head_or_cell.text
 elif head_or_cell.name == 'a':
 yield cls(
 name=head_or_cell.text,
 category=category,
 path=head_or_cell['href'],
 )
 def get_departments(self, session: Session) -> Dict[str, int]:
 with session.get(ROOT + self.path) as response:
 response.raise_for_status()
 doc = BeautifulSoup(response.text, 'xml')
 for li in doc.select('#tabs-departement li'):
 yield Department.from_li(li)
 def __str__(self):
 return self.name
def main():
 pd.set_option('display.max_columns', None)
 pd.set_option('display.width', None)
 with Session() as session:
 agencies = {a.name: a for a in Agency.scrape_all(session)}
 agency_df = pd.DataFrame(
 (astuple(a) for a in agencies.values()),
 columns=('Nom', 'Catégorie', 'Lien'),
 )
 print(agency_df)
 agency = agencies['Crédit Agricole']
 departments = {d.name: d for d in agency.get_departments(session)}
 department = departments['Ardennes']
 branches = {b.name: b for b in Branch.scrape_all(session, department.path)}
if __name__ == '__main__':
 main()

Kate 9,0569 silver badges23 bronze badges · Answer 2 · 2021-04-02 21:15:22Z

To emphasize what Reinderien already said, always check the return of your calls to requests. status_code should return 200. If you get anything else you should stop and investigate. It is possible that the website is blocking you and there is no point running blind.

Also, I recommend that you spoof the user agent, otherwise it is obvious to the web server that you are running a bot and they may block you, or apply more stringent rate limiting measures than a casual user would experience. By default the user agent would be something like this: python-requests/2.25.1.

And since you are making repeated calls, you should use a session instead. Reinderien already refactored your code with session but did not mention this point explicitly. The benefits are persistence (when using cookies for instance) and also more efficient connection pooling at TCP level. And you can use session to set default headers for all your requests.

Example:

>>> session = requests.session()
>>> session.headers
{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>>

Change the user agent for the session, here we spoof Firefox:

session.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0'})

And then you can use session.get to retrieve pages.

May you could be interested in prepared queries too. I strongly recommend that Python developers get acquainted with that section of the documentation.

To speed up the process you could also add parallel processing, for example using threads. But be gentle, if you have too many open connections at the same time, it is one thing that can get you blocked. Usage patterns have to remain reasonable and human-like.

Stack Exchange Network

Web scraping and design pattern lifting

2 Answers 2

First Suggested

BeautifulSoup Alternate

You must log in to answer this question.

Hot Network Questions

Web scraping and design pattern lifting

2 Answers 2

First Suggested

BeautifulSoup Alternate

You must log in to answer this question.

Related

Hot Network Questions