2
\$\begingroup\$

I would like to get some feed back on my code; the goal is to get all agencies address of banks. I wrote a pretty simple brute force algorithm. I was wondering if you would have any advice to improve the code, design it differently (do I need an OOP approach here) etc .

import requests
from lxml import html
groupe = "credit-agricole"
dep = '21'
groupes =["credit-agricole"]
deps = ['53', '44', '56', '35', '22', '49', '72', '29', '85']
def get_nb_pages(groupe,dep):
 """ Return nb_pages ([int]): number of pages containing banks information .
 Args:
 groupe ([string]): bank groupe ("credit-agricole",...)
 dep ([string]): departement ("01",...)
 """
 url ="https://www.moneyvox.fr/pratique/agences/{groupe}/{dep}".format(groupe=groupe,dep=dep)
 req = requests.get(url)
 raw_html = req.text
 xpath = "/html/body/div[2]/article/div/div/div[3]/div[2]/nav/a"
 tree = html.fromstring(raw_html)
 nb_pages = len(tree.xpath(xpath)) +1
 
 return nb_pages
def get_agencies(groupe,dep,page_num):
 """ Return agencies ([List]): description of agencies scrapped on website target page.
 Args:
 groupe ([string]): bank groupe ("credit-agricole",...)
 dep ([string]): departement ("01",...)
 page_num ([int]): target page
 """
 url ="https://www.moneyvox.fr/pratique/agences/{groupe}/{dep}/{page_num}".format(groupe=groupe,dep=dep,page_num=page_num)
 req = requests.get(url)
 raw_html = req.text
 xpath = '//div[@class="lh-bloc-agence like-text"]'
 tree = html.fromstring(raw_html)
 blocs_agencies = tree.xpath(xpath)
 agencies = [] 
 for bloc in blocs_agencies:
 agence = bloc.xpath("div/div[1]/h4")[0].text
 rue = bloc.xpath("div/div[1]/p[1]")[0].text
 code_postale = bloc.xpath("div/div[1]/p[2]")[0].text
 agencies.append((agence,rue,code_postale))
 return agencies
 
def get_all(groupes,deps):
 """Return all_agencies ([List]): description of agencies scrapped.
 Args:
 groupes ([List]): target groups
 deps ([List]): target departments
 """
 all_agencies = []
 for groupe in groupes:
 for dep in deps:
 nb_pages = get_nb_pages(groupe,dep)
 for p in range(1,nb_pages+1):
 agencies = get_agencies(groupe,dep,p)
 all_agencies.extend(agencies) 
 df_agencies = pd.DataFrame(all_agencies,columns=['agence','rue','code_postale'])
 return df_agencies
 
get_nb_pages(groupe,dep)
get_agencies(groupe,dep,1)
df_agencies = get_all(groupes,deps)
```
Reinderien
71k5 gold badges76 silver badges256 bronze badges
asked Apr 1, 2021 at 9:11
\$\endgroup\$

2 Answers 2

1
\$\begingroup\$
  • It's fine for your strings - and scraped web content - to be localised in French; but ensure that your variables are in English (groupe -> group) for consistency
  • Prefer tuples over lists when you have immutable data
  • Add PEP484 type hints when possible
  • Do not leave those first four variables in global scope; move them to a function
  • Consider using f-strings instead of format calls
  • Always check to see if your requests calls fail; the easiest way is via raise_for_status
  • Tell requests when you're done with a response via context management
  • Use actual integers for your department numbers instead of stringly-typed data
  • Consider using an intermediate dataclass for your agency data instead of implicit tuples
  • Consider using generator functions (yield) to simplify your iterative code

First Suggested

from dataclasses import dataclass, astuple
from typing import Iterable, Collection
import pandas as pd
import requests
from lxml import html
from lxml.html import HtmlElement
@dataclass
class Agency:
 name: str
 street: str
 postal_code: str
 @classmethod
 def from_block(cls, block: HtmlElement) -> 'Agency':
 return cls(
 name=block.xpath("div/div[1]/h4")[0].text,
 street=block.xpath("div/div[1]/p[1]")[0].text,
 postal_code=block.xpath("div/div[1]/p[2]")[0].text,
 )
def get_nb_pages(group: str, department: int) -> int:
 """ Return nb_pages ([int]): number of pages containing banks information .
 Args:
 groupe ([string]): bank groupe ("credit-agricole",...)
 department ([string]): departement ("01",...)
 """
 url = f"https://www.moneyvox.fr/pratique/agences/{group}/{department}"
 with requests.get(url) as req:
 req.raise_for_status()
 raw_html = req.text
 xpath = "/html/body/div[2]/article/div/div/div[3]/div[2]/nav/a"
 tree = html.fromstring(raw_html)
 return len(tree.xpath(xpath)) + 1
def get_agencies(group: str, department: int, page_num: int) -> Iterable[Agency]:
 """ Return agencies ([List]): description of agencies scrapped on website target page.
 Args:
 groupe ([string]): bank groupe ("credit-agricole",...)
 department ([string]): departement ("01",...)
 page_num ([int]): target page
 """
 url = f"https://www.moneyvox.fr/pratique/agences/{group}/{department}/{page_num}"
 with requests.get(url) as req:
 req.raise_for_status()
 raw_html = req.text
 xpath = '//div[@class="lh-bloc-agence like-text"]'
 tree = html.fromstring(raw_html)
 for block in tree.xpath(xpath):
 yield Agency.from_block(block)
def get_all(groups: Iterable[str], departments: Collection[int]):
 """Return all_agencies ([List]): description of agencies scrapped.
 Args:
 groupes ([List]): target groups
 departments ([List]): target departments
 """
 for group in groups:
 for department in departments:
 nb_pages = get_nb_pages(group, department)
 for page in range(1, nb_pages + 1):
 yield from get_agencies(group, department, page)
def main():
 group = "credit-agricole"
 department = 21
 groups = ("credit-agricole",)
 departments = (53, 44,) # ... 56, 35, 22, 49, 72, 29, 85)
 n_pages = get_nb_pages(group, department)
 agencies = tuple(get_agencies(group, department, page_num=1))
 all_agencies = get_all(groups, departments)
 df_agencies = pd.DataFrame(
 (astuple(agency) for agency in all_agencies),
 columns=('agence', 'rue', 'code_postale'),
 )
if __name__ == '__main__':
 main()

All of that being the case, your approach using xpath selectors is very fragile. Here is an alternate approach that uses named elements with classes and IDs where available. It is incomplete because I think the site rate-limited my IP, which is of course a direct risk of scraping and totally within the rights of the website.

BeautifulSoup Alternate

import re
from dataclasses import dataclass, astuple
from typing import Iterable, Dict, ClassVar, Pattern
from bs4 import BeautifulSoup, Tag
import pandas as pd
from requests import Session
ROOT = 'https://www.moneyvox.fr'
@dataclass
class Branch:
 name: str
 street: str
 city: str
 postal_code: str
 path: str
 @classmethod
 def scrape_all(cls, session: Session, path: str) -> Iterable['Branch']:
 page = ''
 while True:
 with session.get(ROOT + path + page) as response:
 response.raise_for_status()
 doc = BeautifulSoup(response.text, 'xml')
 body = doc.select_one('div.main-body')
 city = None
 for head_or_cell in body.select('h2, div.lh-bloc-agence'):
 if head_or_cell.name == 'h2':
 city = head_or_cell.text
 elif head_or_cell.name == 'div':
 street, postal_code = head_or_cell.select('p')
 yield cls(
 name=head_or_cell.h4.text,
 street=street.text,
 city=city,
 postal_code=postal_code.text,
 path=head_or_cell.select_one('a.lh-btn-info')['href'],
 )
 # perform depagination here
@dataclass
class Department:
 name: str
 code: str
 path: str
 n_branches: int
 re_count: ClassVar[Pattern] = re.compile(r'\d+')
 @classmethod
 def from_li(cls, li: Tag) -> 'Department':
 return cls(
 name=li.strong.text,
 path=li.a['href'],
 code=cls.re_count.search(li.a.text)[0],
 n_branches=int(cls.re_count.search(li.em.text)[0]),
 )
@dataclass
class Agency:
 name: str
 category: str
 path: str
 @classmethod
 def scrape_all(cls, session: Session) -> Iterable['Agency']:
 with session.get(ROOT + '/pratique/agences') as response:
 response.raise_for_status()
 doc = BeautifulSoup(response.text, 'xml')
 body = doc.select_one('div.main-body')
 category = None
 for head_or_cell in body.select('h2, a.lh-lien-bloc-liste'):
 if head_or_cell.name == 'h2':
 category = head_or_cell.text
 elif head_or_cell.name == 'a':
 yield cls(
 name=head_or_cell.text,
 category=category,
 path=head_or_cell['href'],
 )
 def get_departments(self, session: Session) -> Dict[str, int]:
 with session.get(ROOT + self.path) as response:
 response.raise_for_status()
 doc = BeautifulSoup(response.text, 'xml')
 for li in doc.select('#tabs-departement li'):
 yield Department.from_li(li)
 def __str__(self):
 return self.name
def main():
 pd.set_option('display.max_columns', None)
 pd.set_option('display.width', None)
 with Session() as session:
 agencies = {a.name: a for a in Agency.scrape_all(session)}
 agency_df = pd.DataFrame(
 (astuple(a) for a in agencies.values()),
 columns=('Nom', 'Catégorie', 'Lien'),
 )
 print(agency_df)
 agency = agencies['Crédit Agricole']
 departments = {d.name: d for d in agency.get_departments(session)}
 department = departments['Ardennes']
 branches = {b.name: b for b in Branch.scrape_all(session, department.path)}
if __name__ == '__main__':
 main()
answered Apr 2, 2021 at 4:18
\$\endgroup\$
1
\$\begingroup\$

To emphasize what Reinderien already said, always check the return of your calls to requests. status_code should return 200. If you get anything else you should stop and investigate. It is possible that the website is blocking you and there is no point running blind.

Also, I recommend that you spoof the user agent, otherwise it is obvious to the web server that you are running a bot and they may block you, or apply more stringent rate limiting measures than a casual user would experience. By default the user agent would be something like this: python-requests/2.25.1.

And since you are making repeated calls, you should use a session instead. Reinderien already refactored your code with session but did not mention this point explicitly. The benefits are persistence (when using cookies for instance) and also more efficient connection pooling at TCP level. And you can use session to set default headers for all your requests.

Example:

>>> session = requests.session()
>>> session.headers
{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>> 

Change the user agent for the session, here we spoof Firefox:

session.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0'})

And then you can use session.get to retrieve pages.

May you could be interested in prepared queries too. I strongly recommend that Python developers get acquainted with that section of the documentation.

To speed up the process you could also add parallel processing, for example using threads. But be gentle, if you have too many open connections at the same time, it is one thing that can get you blocked. Usage patterns have to remain reasonable and human-like.

answered Apr 2, 2021 at 21:15
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.