Below are two functions that work no problem in my current script. They are written to be ran in Python 2.7.x
def tor_browser_initialise():
""" This function checks whether the Tor Browser is running. If it isn't,
it will open the Tor Browser.
"""
processlist = []
for p in psutil.process_iter():
try:
process = psutil.Process(p.pid)
pname = process.name()
processlist.append(pname)
except:
continue
if "tor.exe" not in processlist:
process = subprocess.Popen(r"C:\Program Files (x86)\Tor Browser\Browser\firefox.exe", stdout=subprocess.PIPE)
time.sleep(30)
def connect_tor(url):
""" This function accepts a URl as an argument. It accesses the URL via TOR before
returning the HTML source code to the function that called it. This function also
uses random browser information.
"""
LOCALHOST = "127.0.0.1"
PORT = 9150
useragent_list = ['Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/29.0',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.6; Windows NT 6.1; Trident/5.0; InfoPath.2; SLCC1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET CLR 2.0.50727) 3gpp-gba UNTRUSTED/1.0',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14',
'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30']
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, LOCALHOST, PORT)
socket.socket = socks.socksocket
request = urllib2.Request(url)
request.add_header('User-Agent', random.choice(useragent_list))
response = urllib2.urlopen(request)
return response
I'd like to know if there is a more concise and Pythonic way of writing the two functions. I haven't listed the dependant libraries / modules at the beginning of the code, but it does work correctly.
-
\$\begingroup\$ The Tor Website has a Python library with no dependencies: Stem \$\endgroup\$moonman239– moonman2392016年01月03日 02:55:57 +00:00Commented Jan 3, 2016 at 2:55
3 Answers 3
Avoid bare except
Writing except:
without specifying a precise exception is asking for trouble, as anything will be caught, silencing all possible bugs, instead use: except MyExpectedKindOfException
.
Reconsider the very long sleeping
The function tor_browser_initialise
ends with time.sleep(30)
.
That is a lot of time to sleep. Are you 100% sure that any call to that function will want to sleep so much?
Much worse, the sleep
is not documented, so the caller will see his program hang on for 30 seconds for no apparent reason!
Just remove the call to time.sleep
and let the user decide if and how much he wants to sleep
after calling the function.
A word on indentation
You have mismatched indentation levels in tor_browser_initialise
(shouldn't it be "initialize"?): 8 spaces at the beginning and then 4. Choose only one and stick to it. PEP 8 recommend 4 spaces.
It also has some recommendation on aligning continuation lines. You'd better of be using
useragent_list = [
'Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 '
'Firefox/31.0',
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 '
'Firefox/29.0',
'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36',
'Mozilla/5.0 (compatible; MSIE 10.6; Windows NT 6.1; '
'Trident/5.0; InfoPath.2; SLCC1; .NET CLR 3.0.4506.2152; '
'.NET CLR 3.5.30729; .NET CLR 2.0.50727) 3gpp-gba UNTRUSTED/1.0',
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14',
'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) '
'AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'
]
Which also uses implicit string literals continuation to keep line length under 80 characters.
Use constants as such
LOCALHOST
, PORT
, and useragent_list
are constants, you even use uppercase for two of them to emphasize it. Why redefine them each time you call connect_tor
, then?
You should move them from the function body to the top-level of the file. You may also be interested in turning useragent_list
(or USER_AGENTS
which I find better) into an immutable collection such as a tuple
or a frozenset
.
Save on resources and computation
You could improve tor_browser_initialise
by returning early if you find the 'tor.exe'
process. You could thus get rid of the processlist
since exitting the for
loop would mean that you didn't return
early and thus you didn't find the process you were looking for.
def tor_browser_initialise():
for p in psutil.process_iter():
try:
process = psutil.Process(p.pid)
if process.name() == 'tor.exe':
return
except:
continue
subprocess.Popen(
r"C:\Program Files (x86)\Tor Browser\Browser\firefox.exe",
stdout=subprocess.PIPE)
time.sleep(30)
Right now you've hardcoded the location of the TOR browser - this is not ideal. It also assumes a Windows path. This could be improved by passing it as a parameter
def tor_browser_initialise(tor_path):
# stuff
process = subprocess.Popen(tor_path, stdout=subprocess.PIPE)
# other stuff
If you want to provide default paths you could do so like this. It also allows you to provide default paths depending on the operating system using sys.platform
def tor_browser_initialise(tor_path=None):
if tor_path is None:
tor_path = get_default_path()
# the rest of it
DEFAULT_TOR_PATHS = {
'win32': r"C:\Program Files (x86)\Tor Browser\Browser\firefox.exe"
}
def get_default_path():
try:
return DEFAULT_TOR_PATHS[sys.platform]
except KeyError:
raise ValueError(' '.join([
"There is no default path for Tor on your system,"
"detected to be {}.".format(sys.platform),
"You must provide a path"]))
You could also use os.path.join
if you don't want to worry about raw strings or escaping backspaces.