I code a lot of web crawlers and web scrapers and I find myself writing the same functions over and over again. I also find myself having to come to Stack Overflow and find the answer to a question I have had to ask myself a dozen times. Things like "how to supply proxies to requests" or "how to use custom headers with requests" or "how to set the User-Agent in requests" and things of that sort. So I'm writing this module to abstract some of these mundane routines.
My concerns
- Is the code pythonic?
- Would this be of use to anybody other than me?
- Are there any bugs?
- Is it ok to have that many methods in a class?
- How are my naming conventions?
#!/usr/bin/env python
'''
this module was designed with web scrapers and web crawlers in mind.
I find my self writing these functions all the time. I Wrote this model
to save time.
'''
import requests
import urlparse
import urllib2
import urllib
import re
import os
import json
from fake_useragent import UserAgent
class InvalidURL(Exception):
pass
class URL(object):
'''Commomn routines for dealing with URLS.
'''
def __init__(self, url):
'''Setup the initial state
'''
self.raw_url = url
self.url = urlparse.urlparse(url)
self.scheme = self.url.scheme
self.domain = self.url.netloc
self.path = self.url.path
self.params = self.url.params
self.query = self.url.query
self.fragment = self.url.fragment
def __str__(self):
''' This os called when somthing
asks for a string representation of the
url
'''
return self.raw_url
def valid(self):
"""Validate the url.
returns True if url is valid
and False if it is not
"""
regex = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'
r'localhost|' #localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)$', re.IGNORECASE)
match = regex.match(self.raw_url)
if match:
return True
def unquote(self):
"""unquote('abc%20def') -> 'abc def'."""
return urllib2.unquote(self.raw_url)
def quote(self):
"""quote('abc def') -> 'abc%20def'
Each part of a URL, e.g. the path info, the query, etc., has a
different set of reserved characters that must be quoted.
RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists
the following reserved characters.
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
"$" | ","
Each of these characters is reserved in some component of a URL,
but not necessarily in all of them.
By default, the quote function is intended for quoting the path
section of a URL. Thus, it will not encode '/'. This character
is reserved, but in typical usage the quote function is being
called on a path where the existing slash characters are used as
reserved characters.
"""
return urllib2.quote(self.raw_url)
def parameters(self):
"""
parse the parameters of the url
and return them as a dict.
"""
return urlparse.parse_qs(self.params)
def secure(self):
""" Checks if the url uses ssl. """
if self.scheme == 'https':
return True
def extention(self):
""" return the file extention """
return os.path.splitext(self.path)[1]
def absolute(self):
""" Checks if the URL is absolute. """
return bool(self.domain)
def relitive(self):
""" Checks if the url is relitive. """
return bool(self.scheme) is False
def encode(self, mapping):
"""Encode a sequence of two-element tuples or dictionary into a URL query string.
If any values in the query arg are sequences and doseq is true, each
sequence element is converted to a separate parameter.
If the query arg is a sequence of two-element tuples, the order of the
parameters in the output will match the order of parameters in the
input.
"""
query = urllib.urlencode(mapping)
return urlparse.urljoin(self.raw_url, query)
class Request(object):
allow_redirects = True
timeout = 5
ramdom_useragent = 0
verify_ssl = False
session = requests.Session()
stream = True
proxies = {}
def __init__(self, url):
""" Set the inital state """
self.agentHeaders = {}
self.url = URL(url)
if not self.url.valid():
raise InvalidURL("{} is invalid".format(url))
def stream(self, answer):
self.stream = bool(answer)
def randomUserAgent(self):
""" Set a random User-Agent """
self.setUserAgent(UserAgent().random)
def allowRedirects(self, answer):
""" Choose whether or not to follow redirects."""
self.allow_redirects = bool(answer)
def setUserAgent(self, agent):
""" Set the User-Agent """
self.setHeaders('User-Agent', agent)
def setHeaders(self, key, value):
""" Set custom headers """
self.agentHeaders[key] = value
def verify(self, answer):
""" Set whether or not to verify SSL certs"""
self.verify_ssl = bool(answer)
def get(self):
"""Sends a GET request"""
return self.session.get(
url=self.url,
headers=self.agentHeaders,
allow_redirects=self.allow_redirects,
timeout=self.timeout,
verify=self.verify_ssl,
stream=self.stream,
proxies=self.proxies
)
def head(self):
""" Send a head request and return the headers """
return self.session.head(
self.url,
headers=self.agentHeaders,
allow_redirects=self.allow_redirects,
timeout=self.timeout,
verify=self.verify_ssl,
proxies=self.proxies
).headers
def options(self):
""" Send a options request and return the options """
return self.session.options(
self.url,
headers=self.agentHeaders,
allow_redirects=self.allow_redirects,
timeout=self.timeout,
verify=self.verify_ssl,
proxies=self.proxies
).headers['allow']
def json(self):
"""
Deserialize json data (a ``str`` or ``unicode`` instance
containing a JSON document) to a Python object.
"""
return json.loads(self.text)
def headerValue(self, value):
""" Get a value from the headers. """
return self.headers().get(value)
request = Request('https://www.google.com')
req = request.get()
-
\$\begingroup\$ Please look for existing relevant site tags before trying to create your own. We don't need a tag for every specific thing. \$\endgroup\$Jamal– Jamal2017年07月29日 20:48:04 +00:00Commented Jul 29, 2017 at 20:48
3 Answers 3
Your code feels like an unnecessary duplication of existing things.
(I’m going to skip things alecxe already mentioned)
- Most methods on the
URL
class are one-liners that refer to the raw URL and pass it onto some urlparse/urllib2 function. If you need only one of those functions, it would be better to dourllib2.unquote(some_url)
thanURL(some_url).unquote()
— in addition to readability, your method creates an object that is very quickly discarded (and callsurlparse
, the results of which are unused). secure
is misleading —https
is not the only TLS-using protocol out there- Typo:
relative
- The
Request
class is again overcomplicating and duplicating code. It exposes only a few features of the library, making it very inflexible. It uses a single session for every request, which means leaking state between requests. You still need to type more:
request = Request('https://www.google.com') req = request.get() # -- versus -- req = requests.get('https://www.google.com') # -- and if you need sessions, it’s still shorter -- s = requests.Session() req = s.get('https://www.google.com')
Users still need to interact with requests’
Response
objects. In fact, after issuing the requests, users will discard your Request object, simply because it’s unnecessary.- The
json
andheaderValue
methods are broken. (json
should useResponse.json()
, btw) - Setter methods (
allowRedirects
,verify
,setUserAgent
,setHeaders
) are unnecessary and considered very bad style in Python. Additionally, the names ofallowRedirects
andverify
are easy to confuse forallow_redirects
andverify_ssl
(the underlying properties) - It does not make sense to call
.get
then.post
(or.get
twice) on the same thing, this is whyrequests.get(url)
andrequest.Request('GET', url)
explicitly specify the method.
Code Style notes
- organize imports in separate groups, have a single line break between the groups, have two newlines after the imports and before the code starts (PEP8 reference)
- have two blank lines between the class definitions, single blank line between class methods, remove extra newlines (PEP8 reference)
- have your docstrings properly formatted - they should be in triple double quotes, start with a capital letter and end with a dot (PEP8 reference)
- naming - use
lower_case_with_underscores
variable and method naming style (PEP8 reference)
Other notes and thoughts
- Python 3 compatibility - as of now, the code is Python-2.x only - if you want the code to be re-used by others, think about making it both Python 2 and 3 compatible
- beware of God objects
- you can use "verbose" mode for your regular expression which might make it even more readable - even though you've done a good job documenting it
- I am not 100% sure about having a
session
instance as a class variable - I think it should better be an instance variable (differences) - I think
Request
class requires some explanation - consider adding a docstring
You asked
Would this be of use to anybody other than me?
Likely you already have automated tests that exercise all the lines of code -- it would be useful to post the tests along with the module. This would help answer questions such as, "have we ever seen callers with a need manipulate Request headers after construction?", leading perhaps to the "setter" code moving into __init__()
. Consider using an underscore prefix for methods you don't intend to be public.
This identifier has the wrong name:
self.agentHeaders = {}
Rather than agent_headers
, more accurately it would simply be headers
, since currently the public API offers support for adding arbitrary headers.
Typo: extention
. This is a typo, plus it's unused: ramdom_useragent = 0
Double un-quoting errors are common enough in web code (e.g. https://bugs.python.org/issue2244 ). Your module has an opportunity to immediately offer the caller an exception at that point, making such bugs shallow.