I looked back on an old project of mine and decided to work on it again, while abstaining from using Selenium. I was able to do so quite successfully, using requests
and bs4
. However, manually handling each of the requests to the ASP.NET website made me dislike the complexity of my code, as I wasn't using any web interaction libraries.
I am looking for feedback on the SimpleBrowser
tool, as well as the leapy
program. SimpleBrowser
has been quite purpose built, to make leapy
simpler, so I am looking to make it more versatile and universal. I am curious about what could be done differently, I'll gladly listen to any nitpicks, as I still consider myself a noob. :P
browser.py
import requests
from bs4 import BeautifulSoup, SoupStrainer
class BrowserError(Exception):
pass
class ParsingError(BrowserError):
pass
class NoWebsiteLoadedError(BrowserError):
pass
class SimpleBrowser:
"""Low-level HTTP browser to simplify interacting with websites.
Attributes:
parser: Used in website parsing, defaults to `lxml`.
session: A reusable TCP connection, useful for making requests to the
same website and managing cookies.
<http://docs.python-requests.org/en/master/user/advanced/#session-objects>
url: Full URL of currently loaded website.
response: Response of currently loaded website.
"""
def __init__(self, parser='lxml'):
self.parser = parser
self.session = requests.Session()
self._url = None
self._response = None
@property
def url(self):
"""Return the URL of currently loaded website."""
return self._url
@property
def response(self):
"""Return the `Response` object of currently loaded website."""
return self._response
@property
def cookies(self):
"""Return the CookieJar instance of the current `Session`."""
return self.session.cookies
def soup(self, *args, **kwargs):
"""Parse the currently loaded website.
Optionally, SoupStrainer can be used to only parse relevant
parts of the page. This can be particularly useful if the website is
complex or perfomance is a factor.
<https://www.crummy.com/software/BeautifulSoup/bs4/doc/#soupstrainer>
Args:
*args: Optional positional arguments that `SoupStrainer` takes.
**kwargs: Optional keyword argument that `SoupStrainer` takes.
Returns:
A `BeautifulSoup` object.
Raises:
NoWebsiteLoadedError: If no website is currently loaded.
ParsingError: If the current response isn't supported by `bs4`
"""
if self._url is None:
raise NoWebsiteLoadedError('website parsing requires a loaded website')
content_type = self._response.headers.get('Content-Type', '')
if not any(markup in content_type for markup in ('html', 'xml')):
raise ParsingError('unsupported content type \'{}\''.format(content_type))
strainer = SoupStrainer(*args, **kwargs)
return BeautifulSoup(self._response.content, self.parser, parse_only=strainer)
def get(self, url, **kwargs):
"""Send a GET request to the specified URL.
Method directly wraps around `Session.get` and updates browser
attributes.
<http://docs.python-requests.org/en/master/api/#requests.get>
Args:
url: URL for the new `Request` object.
**kwargs: Optional arguments that `Request` takes.
Returns:
`Response` object of a successful request.
"""
response = self.session.get(url, **kwargs)
self._url = response.url
self._response = response
return response
def post(self, **kwargs):
"""Send a POST request to the currently loaded website's URL.
The browser will automatically fill out the form. If `data` dict has
been passed into ``kwargs``, the contained input values will override
the automatically filled out values.
Returns:
`Response` object of a successful request.
Raises:
NoWebsiteLoadedError: If no website is currently loaded.
"""
if self._url is None:
raise NoWebsiteLoadedError('request submission requires a loaded website')
data = kwargs.get('data', {})
for i in self.soup('form').select('input[name]'):
if i.get('name') not in data:
data[i.get('name')] = i.get('value', '')
kwargs['data'] = data
response = self.session.post(self._url, **kwargs)
self._url = response.url
self._response = response
return response
leapy.py
import re
from browser import SimpleBrowser
class LeapError(Exception):
pass
class LoginError(LeapError):
pass
class Leap:
"""Interface class for automated access to the Leapcard website.
Attributes:
browser: An instance of `SimpleBrowser`
"""
BASE_URL = 'https://www.leapcard.ie/en/'
LOGIN_URL = BASE_URL + 'login.aspx'
TABLE_URL = BASE_URL + 'SelfServices/CardServices/ViewJourneyHistory.aspx'
def __init__(self):
self.browser = SimpleBrowser()
@property
def login_cookie(self):
"""Return True if user authentication is successful."""
return any('ASPXFORMSAUTH' in c.name for c in self.browser.cookies)
def login(self, username, password):
"""Authenticate a user account to access user information.
Args:
username: Leapcard.ie account username
password: Leapcard.ie account password
Raises:
LoginError: If user authentication fails.
"""
self.browser.get(self.LOGIN_URL)
data = {
'ctl00$ContentPlaceHolder1$UserName': username,
'ctl00$ContentPlaceHolder1$Password': password,
'ctl00$ContentPlaceHolder1$btnlogin': 'Login'
}
self.browser.post(data=data)
if self.login_cookie is False:
raise LoginError('user login failure')
def select_card(self, card_number):
"""Select the requested card number from the dropdown menu.
In case of an account with multiple cards registered, this method
will ensure that the correct card has been selected.
Args:
card_number: Unique Leap card number
Raises:
LeapError: If requested card is not registered in user account.
"""
cards = self.browser.soup().select_one('select[id*=CardsList]')
registered_cards = {c.text.split()[0]: c.get('value') for c in cards.select('option[value]')}
if card_number not in registered_cards:
raise LeapError('requested card not registered: {}'.format(card_number))
data = {cards.get('name'): registered_cards.get(card_number)}
self.browser.post(data=data)
@property
def balance(self):
"""Fetch dictionary with last known travel credit balance.
Returns:
A dictionary containing date and time of the last transaction
made with a Leap card and the balance after the transaction.
"""
self.browser.get(self.TABLE_URL)
table = self.browser.soup().select_one('table[id*=CardJourney]')
date = table.find_next(text=re.compile(r'\d{2}/\d{2}/\d{4}'))
time = table.find_next(text=re.compile(r'\d{1,2}:\d{2} \wM'))
balance = table.find_next(text=re.compile(r'€-?\d{1,3}\.\d{2}')).next_element.text.strip('€')
return {'date': date, 'time': time, 'balance': balance}
Example table of last 2 transactions: Transaction table
<table class="table" cellspacing="0" cellpadding="3" rules="all" align="left" rules="none" id="gvCardJourney" style="border-width:1px;border-style:solid;width:100%;border-collapse:collapse;">
<caption>
Travel Credit History Information
</caption><tr class="grid-header" align="left" style="color:White;background-color:#008033;">
<th scope="col" abbr="Date">Date</th><th scope="col" abbr="Time">Time</th><th scope="col" abbr="ParticipantShortNameDescription">Source</th><th scope="col" abbr="TransactionTypeDescription">Transaction Type</th><th scope="col" abbr="TransactionAmountEuro">Amount</th><th scope="col" abbr="PurseAmountEuro">Balance</th>
</tr><tr style="background-color:#EDEDED;">
<td align="center">24/11/2017</td><td align="center" style="white-space:nowrap;">12:41 PM</td><td align="center">Luas</td><td align="center">Travel Credit Returned</td><td align="center">2ドル.13</td><td align="center">6ドル.49</td>
</tr><tr style="background-color:#F2F1F1;">
<td align="center">24/11/2017</td><td align="center" style="white-space:nowrap;">12:31 PM</td><td align="center">Luas</td><td align="center">Travel Credit Deduction</td><td align="center">€-2.13</td><td align="center">4ドル.36</td>
</tr>
1 Answer 1
The code is pretty well-documented and understandable, great job!
Just a few of the thoughts, nitpicks and ideas:
- move exceptions definitions to a separate module/file,
exceptions.py
? - and, instead of having
Exception
as a base class for your exceptions, consider introducing your own base exception class - like, for example, theRequestException
in therequests
library - note that you may specify docstrings for exception classes instead of
pass
keywords - win-win, an opportunity to document an exception and follow the language rules - instead of a string concatenation, use
urljoin()
for url joining - think of more descriptive and explicit variable names than
i
orc
I would probably explicitly specify
A
orP
for the regular expression to match the time:\d{1,2}:\d{2} [AP]M
- I am not completely confident in the regular expression for the balance..it would not, for instance, match the
1000ドル.00
balance value because of the\d{1,3}
requirement. Also, the regular expression assumes that the decimal part will always be present - recheck if this is always true for that page
-
\$\begingroup\$ Thanks Alexander! I'm still new to regex, but the maximum possible balance on a Leap card is 150ドル.00, with the decimal part always present. This way, I figured I would match between 1 and 3 digits before decimal point and exactly 2 after. \$\endgroup\$Luke– Luke2017年11月27日 18:41:48 +00:00Commented Nov 27, 2017 at 18:41
Explore related questions
See similar questions with these tags.