Simple, low-level browser for easier website interaction

Question 1

I looked back on an old project of mine and decided to work on it again, while abstaining from using Selenium. I was able to do so quite successfully, using requests and bs4. However, manually handling each of the requests to the ASP.NET website made me dislike the complexity of my code, as I wasn't using any web interaction libraries.

I am looking for feedback on the SimpleBrowser tool, as well as the leapy program. SimpleBrowser has been quite purpose built, to make leapy simpler, so I am looking to make it more versatile and universal. I am curious about what could be done differently, I'll gladly listen to any nitpicks, as I still consider myself a noob. :P

browser.py

import requests
from bs4 import BeautifulSoup, SoupStrainer
class BrowserError(Exception):
 pass
class ParsingError(BrowserError):
 pass
class NoWebsiteLoadedError(BrowserError):
 pass
class SimpleBrowser:
 """Low-level HTTP browser to simplify interacting with websites.
 Attributes:
 parser: Used in website parsing, defaults to `lxml`.
 session: A reusable TCP connection, useful for making requests to the
 same website and managing cookies.
 <http://docs.python-requests.org/en/master/user/advanced/#session-objects>
 url: Full URL of currently loaded website.
 response: Response of currently loaded website.
 """
 def __init__(self, parser='lxml'):
 self.parser = parser
 self.session = requests.Session()
 self._url = None
 self._response = None
 @property
 def url(self):
 """Return the URL of currently loaded website."""
 return self._url
 @property
 def response(self):
 """Return the `Response` object of currently loaded website."""
 return self._response
 @property
 def cookies(self):
 """Return the CookieJar instance of the current `Session`."""
 return self.session.cookies
 def soup(self, *args, **kwargs):
 """Parse the currently loaded website.
 Optionally, SoupStrainer can be used to only parse relevant
 parts of the page. This can be particularly useful if the website is
 complex or perfomance is a factor.
 <https://www.crummy.com/software/BeautifulSoup/bs4/doc/#soupstrainer>
 Args:
 *args: Optional positional arguments that `SoupStrainer` takes.
 **kwargs: Optional keyword argument that `SoupStrainer` takes.
 Returns:
 A `BeautifulSoup` object.
 Raises:
 NoWebsiteLoadedError: If no website is currently loaded.
 ParsingError: If the current response isn't supported by `bs4`
 """
 if self._url is None:
 raise NoWebsiteLoadedError('website parsing requires a loaded website')
 content_type = self._response.headers.get('Content-Type', '')
 if not any(markup in content_type for markup in ('html', 'xml')):
 raise ParsingError('unsupported content type \'{}\''.format(content_type))
 strainer = SoupStrainer(*args, **kwargs)
 return BeautifulSoup(self._response.content, self.parser, parse_only=strainer)
 def get(self, url, **kwargs):
 """Send a GET request to the specified URL.
 Method directly wraps around `Session.get` and updates browser
 attributes.
 <http://docs.python-requests.org/en/master/api/#requests.get>
 Args:
 url: URL for the new `Request` object.
 **kwargs: Optional arguments that `Request` takes.
 Returns:
 `Response` object of a successful request.
 """
 response = self.session.get(url, **kwargs)
 self._url = response.url
 self._response = response
 return response
 def post(self, **kwargs):
 """Send a POST request to the currently loaded website's URL.
 The browser will automatically fill out the form. If `data` dict has
 been passed into ``kwargs``, the contained input values will override
 the automatically filled out values.
 Returns:
 `Response` object of a successful request.
 Raises:
 NoWebsiteLoadedError: If no website is currently loaded.
 """
 if self._url is None:
 raise NoWebsiteLoadedError('request submission requires a loaded website')
 data = kwargs.get('data', {})
 for i in self.soup('form').select('input[name]'):
 if i.get('name') not in data:
 data[i.get('name')] = i.get('value', '')
 kwargs['data'] = data
 response = self.session.post(self._url, **kwargs)
 self._url = response.url
 self._response = response
 return response

leapy.py

import re
from browser import SimpleBrowser
class LeapError(Exception):
 pass
class LoginError(LeapError):
 pass
class Leap:
 """Interface class for automated access to the Leapcard website.
 Attributes:
 browser: An instance of `SimpleBrowser`
 """
 BASE_URL = 'https://www.leapcard.ie/en/'
 LOGIN_URL = BASE_URL + 'login.aspx'
 TABLE_URL = BASE_URL + 'SelfServices/CardServices/ViewJourneyHistory.aspx'
 def __init__(self):
 self.browser = SimpleBrowser()
 @property
 def login_cookie(self):
 """Return True if user authentication is successful."""
 return any('ASPXFORMSAUTH' in c.name for c in self.browser.cookies)
 def login(self, username, password):
 """Authenticate a user account to access user information.
 Args:
 username: Leapcard.ie account username
 password: Leapcard.ie account password
 Raises:
 LoginError: If user authentication fails.
 """
 self.browser.get(self.LOGIN_URL)
 data = {
 'ctl00$ContentPlaceHolder1$UserName': username,
 'ctl00$ContentPlaceHolder1$Password': password,
 'ctl00$ContentPlaceHolder1$btnlogin': 'Login'
 }
 self.browser.post(data=data)
 if self.login_cookie is False:
 raise LoginError('user login failure')
 def select_card(self, card_number):
 """Select the requested card number from the dropdown menu.
 In case of an account with multiple cards registered, this method
 will ensure that the correct card has been selected.
 Args:
 card_number: Unique Leap card number
 Raises:
 LeapError: If requested card is not registered in user account.
 """
 cards = self.browser.soup().select_one('select[id*=CardsList]')
 registered_cards = {c.text.split()[0]: c.get('value') for c in cards.select('option[value]')}
 if card_number not in registered_cards:
 raise LeapError('requested card not registered: {}'.format(card_number))
 data = {cards.get('name'): registered_cards.get(card_number)}
 self.browser.post(data=data)
 @property
 def balance(self):
 """Fetch dictionary with last known travel credit balance.
 Returns:
 A dictionary containing date and time of the last transaction
 made with a Leap card and the balance after the transaction.
 """
 self.browser.get(self.TABLE_URL)
 table = self.browser.soup().select_one('table[id*=CardJourney]')
 date = table.find_next(text=re.compile(r'\d{2}/\d{2}/\d{4}'))
 time = table.find_next(text=re.compile(r'\d{1,2}:\d{2} \wM'))
 balance = table.find_next(text=re.compile(r'€-?\d{1,3}\.\d{2}')).next_element.text.strip('€')
 return {'date': date, 'time': time, 'balance': balance}

Example table of last 2 transactions: Transaction table

<table class="table" cellspacing="0" cellpadding="3" rules="all" align="left" rules="none" id="gvCardJourney" style="border-width:1px;border-style:solid;width:100%;border-collapse:collapse;">
 <caption>
 Travel Credit History Information
 </caption><tr class="grid-header" align="left" style="color:White;background-color:#008033;">
 <th scope="col" abbr="Date">Date</th><th scope="col" abbr="Time">Time</th><th scope="col" abbr="ParticipantShortNameDescription">Source</th><th scope="col" abbr="TransactionTypeDescription">Transaction Type</th><th scope="col" abbr="TransactionAmountEuro">Amount</th><th scope="col" abbr="PurseAmountEuro">Balance</th>
 </tr><tr style="background-color:#EDEDED;">
 <td align="center">24/11/2017</td><td align="center" style="white-space:nowrap;">12:41 PM</td><td align="center">Luas</td><td align="center">Travel Credit Returned</td><td align="center">2ドル.13</td><td align="center">6ドル.49</td>
 </tr><tr style="background-color:#F2F1F1;">
 <td align="center">24/11/2017</td><td align="center" style="white-space:nowrap;">12:31 PM</td><td align="center">Luas</td><td align="center">Travel Credit Deduction</td><td align="center">€-2.13</td><td align="center">4ドル.36</td>
 </tr>

Question 2

The code is pretty well-documented and understandable, great job!

Just a few of the thoughts, nitpicks and ideas:

move exceptions definitions to a separate module/file, exceptions.py?
and, instead of having Exception as a base class for your exceptions, consider introducing your own base exception class - like, for example, the RequestException in the requests library
note that you may specify docstrings for exception classes instead of pass keywords - win-win, an opportunity to document an exception and follow the language rules
instead of a string concatenation, use urljoin() for url joining
think of more descriptive and explicit variable names than i or c
I would probably explicitly specify A or P for the regular expression to match the time:
```
\d{1,2}:\d{2} [AP]M
```
I am not completely confident in the regular expression for the balance..it would not, for instance, match the 1000ドル.00 balance value because of the \d{1,3} requirement. Also, the regular expression assumes that the decimal part will always be present - recheck if this is always true for that page

Question 3

Thanks Alexander! I'm still new to regex, but the maximum possible balance on a Leap card is 150ドル.00, with the decimal part always present. This way, I figured I would match between 1 and 3 digits before decimal point and exactly 2 after.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-11-27 17:34:06Z

The code is pretty well-documented and understandable, great job!

Just a few of the thoughts, nitpicks and ideas:

move exceptions definitions to a separate module/file, exceptions.py?
and, instead of having Exception as a base class for your exceptions, consider introducing your own base exception class - like, for example, the RequestException in the requests library
note that you may specify docstrings for exception classes instead of pass keywords - win-win, an opportunity to document an exception and follow the language rules
instead of a string concatenation, use urljoin() for url joining
think of more descriptive and explicit variable names than i or c
I would probably explicitly specify A or P for the regular expression to match the time:
```
\d{1,2}:\d{2} [AP]M
```
I am not completely confident in the regular expression for the balance..it would not, for instance, match the 1000ドル.00 balance value because of the \d{1,3} requirement. Also, the regular expression assumes that the decimal part will always be present - recheck if this is always true for that page

Thanks Alexander! I'm still new to regex, but the maximum possible balance on a Leap card is 150ドル.00, with the decimal part always present. This way, I figured I would match between 1 and 3 digits before decimal point and exactly 2 after.

Stack Exchange Network

Simple, low-level browser for easier website interaction

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Simple, low-level browser for easier website interaction

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions