4
\$\begingroup\$

I code a lot of web crawlers and web scrapers and I find myself writing the same functions over and over again. I also find myself having to come to Stack Overflow and find the answer to a question I have had to ask myself a dozen times. Things like "how to supply proxies to requests" or "how to use custom headers with requests" or "how to set the User-Agent in requests" and things of that sort. So I'm writing this module to abstract some of these mundane routines.

My concerns

  • Is the code pythonic?
  • Would this be of use to anybody other than me?
  • Are there any bugs?
  • Is it ok to have that many methods in a class?
  • How are my naming conventions?

#!/usr/bin/env python
'''
this module was designed with web scrapers and web crawlers in mind.
I find my self writing these functions all the time. I Wrote this model
to save time.
'''
import requests
import urlparse
import urllib2
import urllib
import re
import os
import json
from fake_useragent import UserAgent
class InvalidURL(Exception):
 pass
class URL(object):
 '''Commomn routines for dealing with URLS.
 '''
 def __init__(self, url):
 '''Setup the initial state
 '''
 self.raw_url = url
 self.url = urlparse.urlparse(url)
 self.scheme = self.url.scheme
 self.domain = self.url.netloc
 self.path = self.url.path
 self.params = self.url.params
 self.query = self.url.query
 self.fragment = self.url.fragment
 def __str__(self):
 ''' This os called when somthing
 asks for a string representation of the
 url
 '''
 return self.raw_url
 def valid(self):
 """Validate the url.
 returns True if url is valid
 and False if it is not
 """
 regex = re.compile(
 r'^(?:http|ftp)s?://' # http:// or https://
 r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'
 r'localhost|' #localhost...
 r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
 r'(?::\d+)?' # optional port
 r'(?:/?|[/?]\S+)$', re.IGNORECASE)
 match = regex.match(self.raw_url)
 if match:
 return True
 def unquote(self):
 """unquote('abc%20def') -> 'abc def'."""
 return urllib2.unquote(self.raw_url)
 def quote(self):
 """quote('abc def') -> 'abc%20def'
 Each part of a URL, e.g. the path info, the query, etc., has a
 different set of reserved characters that must be quoted.
 RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists
 the following reserved characters.
 reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
 "$" | ","
 Each of these characters is reserved in some component of a URL,
 but not necessarily in all of them.
 By default, the quote function is intended for quoting the path
 section of a URL. Thus, it will not encode '/'. This character
 is reserved, but in typical usage the quote function is being
 called on a path where the existing slash characters are used as
 reserved characters.
 """
 return urllib2.quote(self.raw_url)
 def parameters(self):
 """
 parse the parameters of the url
 and return them as a dict.
 """
 return urlparse.parse_qs(self.params)
 def secure(self):
 """ Checks if the url uses ssl. """
 if self.scheme == 'https':
 return True
 def extention(self):
 """ return the file extention """
 return os.path.splitext(self.path)[1]
 def absolute(self):
 """ Checks if the URL is absolute. """
 return bool(self.domain)
 def relitive(self):
 """ Checks if the url is relitive. """
 return bool(self.scheme) is False
 def encode(self, mapping):
 """Encode a sequence of two-element tuples or dictionary into a URL query string.
 If any values in the query arg are sequences and doseq is true, each
 sequence element is converted to a separate parameter.
 If the query arg is a sequence of two-element tuples, the order of the
 parameters in the output will match the order of parameters in the
 input.
 """
 query = urllib.urlencode(mapping)
 return urlparse.urljoin(self.raw_url, query)
class Request(object):
 allow_redirects = True
 timeout = 5
 ramdom_useragent = 0
 verify_ssl = False
 session = requests.Session()
 stream = True
 proxies = {}
 def __init__(self, url):
 """ Set the inital state """
 self.agentHeaders = {}
 self.url = URL(url)
 if not self.url.valid():
 raise InvalidURL("{} is invalid".format(url))
 def stream(self, answer):
 self.stream = bool(answer)
 def randomUserAgent(self):
 """ Set a random User-Agent """
 self.setUserAgent(UserAgent().random)
 def allowRedirects(self, answer):
 """ Choose whether or not to follow redirects."""
 self.allow_redirects = bool(answer)
 def setUserAgent(self, agent):
 """ Set the User-Agent """
 self.setHeaders('User-Agent', agent)
 def setHeaders(self, key, value):
 """ Set custom headers """
 self.agentHeaders[key] = value
 def verify(self, answer):
 """ Set whether or not to verify SSL certs"""
 self.verify_ssl = bool(answer)
 def get(self):
 """Sends a GET request"""
 return self.session.get(
 url=self.url,
 headers=self.agentHeaders,
 allow_redirects=self.allow_redirects,
 timeout=self.timeout,
 verify=self.verify_ssl,
 stream=self.stream,
 proxies=self.proxies
 )
 def head(self):
 """ Send a head request and return the headers """
 return self.session.head(
 self.url,
 headers=self.agentHeaders,
 allow_redirects=self.allow_redirects,
 timeout=self.timeout,
 verify=self.verify_ssl,
 proxies=self.proxies
 ).headers
 def options(self):
 """ Send a options request and return the options """
 return self.session.options(
 self.url,
 headers=self.agentHeaders,
 allow_redirects=self.allow_redirects,
 timeout=self.timeout,
 verify=self.verify_ssl,
 proxies=self.proxies
 ).headers['allow']
 def json(self):
 """
 Deserialize json data (a ``str`` or ``unicode`` instance
 containing a JSON document) to a Python object.
 """
 return json.loads(self.text)
 def headerValue(self, value):
 """ Get a value from the headers. """
 return self.headers().get(value)
request = Request('https://www.google.com')
req = request.get()
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Jul 29, 2017 at 20:40
\$\endgroup\$
1
  • \$\begingroup\$ Please look for existing relevant site tags before trying to create your own. We don't need a tag for every specific thing. \$\endgroup\$ Commented Jul 29, 2017 at 20:48

3 Answers 3

4
\$\begingroup\$

Your code feels like an unnecessary duplication of existing things.

(I’m going to skip things alecxe already mentioned)

  • Most methods on the URL class are one-liners that refer to the raw URL and pass it onto some urlparse/urllib2 function. If you need only one of those functions, it would be better to do urllib2.unquote(some_url) than URL(some_url).unquote() — in addition to readability, your method creates an object that is very quickly discarded (and calls urlparse, the results of which are unused).
  • secure is misleading — https is not the only TLS-using protocol out there
  • Typo: relative

  • The Request class is again overcomplicating and duplicating code. It exposes only a few features of the library, making it very inflexible. It uses a single session for every request, which means leaking state between requests.
  • You still need to type more:

    request = Request('https://www.google.com')
    req = request.get()
    # -- versus --
    req = requests.get('https://www.google.com')
    # -- and if you need sessions, it’s still shorter --
    s = requests.Session()
    req = s.get('https://www.google.com')
    
  • Users still need to interact with requests’ Response objects. In fact, after issuing the requests, users will discard your Request object, simply because it’s unnecessary.

  • The json and headerValue methods are broken. (json should use Response.json(), btw)
  • Setter methods (allowRedirects, verify, setUserAgent, setHeaders) are unnecessary and considered very bad style in Python. Additionally, the names of allowRedirects and verify are easy to confuse for allow_redirects and verify_ssl (the underlying properties)
  • It does not make sense to call .get then .post (or .get twice) on the same thing, this is why requests.get(url) and request.Request('GET', url) explicitly specify the method.
answered Jul 30, 2017 at 13:08
\$\endgroup\$
0
2
\$\begingroup\$

Code Style notes

  • organize imports in separate groups, have a single line break between the groups, have two newlines after the imports and before the code starts (PEP8 reference)
  • have two blank lines between the class definitions, single blank line between class methods, remove extra newlines (PEP8 reference)
  • have your docstrings properly formatted - they should be in triple double quotes, start with a capital letter and end with a dot (PEP8 reference)
  • naming - use lower_case_with_underscores variable and method naming style (PEP8 reference)

Other notes and thoughts

  • Python 3 compatibility - as of now, the code is Python-2.x only - if you want the code to be re-used by others, think about making it both Python 2 and 3 compatible
  • beware of God objects
  • you can use "verbose" mode for your regular expression which might make it even more readable - even though you've done a good job documenting it
  • I am not 100% sure about having a session instance as a class variable - I think it should better be an instance variable (differences)
  • I think Request class requires some explanation - consider adding a docstring
answered Jul 30, 2017 at 2:51
\$\endgroup\$
2
\$\begingroup\$

You asked

Would this be of use to anybody other than me?

Likely you already have automated tests that exercise all the lines of code -- it would be useful to post the tests along with the module. This would help answer questions such as, "have we ever seen callers with a need manipulate Request headers after construction?", leading perhaps to the "setter" code moving into __init__(). Consider using an underscore prefix for methods you don't intend to be public.

This identifier has the wrong name:

 self.agentHeaders = {}

Rather than agent_headers, more accurately it would simply be headers, since currently the public API offers support for adding arbitrary headers.

Typo: extention. This is a typo, plus it's unused: ramdom_useragent = 0

Double un-quoting errors are common enough in web code (e.g. https://bugs.python.org/issue2244 ). Your module has an opportunity to immediately offer the caller an exception at that point, making such bugs shallow.

answered Jul 30, 2017 at 17:08
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.