I have written a program in Python that checks websites for updates (it checks in anything on the website changes) and notifies the user. The idea is downloading the webpage, stripping HTML tags off it so extract the actual text and then calculating its md5 hash, so that we know when something changes. This is what my code looks right now (it's working on SOME websites). I would appreciate feedback about my style / logic / possible features that could be interesting etc. I also would like to know how to find out on which websites my code works and on which websites it doesn't work and why (for example if there are dynamic numbers that change it will mess up the MD5 hash).
import vlc # needed for the music feature
import hashlib
import os
import time
from datetime import datetime
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
from urllib.error import URLError, HTTPError
from pushover import init, Client # push notifications on your phone
# set the path of the music if u use the feature
# make sure to set up pushover on your phone before u use it
def check_for_update():
if(os.path.isfile("website.txt")):
req = Request(url)
try:
response = urlopen(req)
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
html = urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines
for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
file = open("website_new.txt", "w")
file.write(text)
file.close()
hasher = hashlib.md5()
with open('website_new.txt', 'rb') as afile:
buf = afile.read()
hasher.update(buf)
global new_md5sum
new_md5sum = hasher.hexdigest()
else:
html = urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines
for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
file = open("website.txt", "w")
file.write(text)
file.close()
hasher = hashlib.md5()
with open('website.txt', 'rb') as afile:
buf = afile.read()
hasher.update(buf)
global original_md5sum
original_md5sum = hasher.hexdigest()
check_for_update()
def main():
global url
url = input("Paste the URL you want to check for updates: ")
global push
while True:
temp = input("\nDo you want to get a notification \
to your phone when the website has been changed? (y/n): ")
if (temp != "y" and temp != "n"):
print("Error: Please enter y or n")
else:
if temp == "y":
push = True
print("Notifications to your phone have been turned ON\n")
break
else:
print("Notifications to your phone have been turned OFF\n")
break
global music
while True:
temp = input("Do you want to play a song \
when the website has been changed? (y/n): ")
if (temp != "y" and temp != "n"):
print("Error: Please enter y or n")
else:
if temp == "y":
music = True
print("The music feature has been turned ON\n")
break
else:
print("The music feature has been turned OFF\n")
break
global update_timer
while True:
temp = input("How often do you want to check \
the website for updates? Enter it in seconds (min. 20): ")
if (temp.isdigit()):
temp = int(temp)
if temp > 19:
print("The website will be checked for \
updates every " + str(temp) + " seconds\n")
update_timer = temp
break
else:
print("Make sure to enter a value bigger than 19\n")
else:
print("Please enter an integer (which has to be bigger than 19)\n")
path = os.path.dirname(os.path.realpath(__file__))
try:
os.remove(path + "/website.txt")
except OSError:
pass
try:
os.remove(path + "/website_new.txt")
except OSError:
pass
original_md5sum = ""
new_md5sum = ""
check_for_update()
mainloop()
def mainloop():
while True:
check_for_update()
'''
print("Original: ", original_md5sum)
print("New: ", new_md5sum)
'''
if original_md5sum == new_md5sum:
print("Website hasn't been updated yet... " +
datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
else:
print("Website hat been updated! " +
datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
if push is True:
init("<token>")
Client("<client_id>\
").send_message("Website has been updated!", title="Website update")
if music is True:
# example: file:///home/anon/Music/song.mp3
p = vlc.MediaPlayer("file://<path>")
p.play()
time.sleep(60)
p.stop()
break
time.sleep(update_timer)
main()
1 Answer 1
First things first, I really like your idea and I think with a couple of small changes it would be a really nice utility to use! To start out with, here's a couple of suggestions regarding code style:
import
s are normally grouped:- standard library imports
- related third party imports
- local application / library specific imports
these categories should be separated by a single blank line:from datetime import datetime import hashlib import os import time from urllib.request import urlopen, Request from urllib.error import URLError, HTTPError from pushover import init, Client from bs4 import BeautifulSoup import vlc
Whitespace is good, if it's used sparingly. Excess use of whitespace can actually reduce readability, doing only harm to your code. I personally don't put a blank line in between a function signature and its first line, but that too is mostly up to you.
Those are the only real coding style issues I can find. Here's some other remarks about your code:
if
-statements don't have to be wrapped in parentheses:if(os.path.isfile("website.txt")):
... could simply become:
if os.path.isfile("website.txt"):
You can use a context manager for opening files, which is slightly more pythonic than manually calling
open()
andclose()
. The issue with dealing with files is it's essential toclose()
the file to prevent any data from being corrupted or lost. Thus, you need to wrap such a code block in atry
/except
-construct:try: f = open(<filename>, <mode>) # File operations here finally: f.close()
... if you forget to wrap it in such a statement, if an exception is raised while operating on a file, data may be damaged or lost. You can use context managers (the
with
statement) to automatically close the file when you release the resource:with open(<filename>, <mode>) as <variable>: # File operations follow # No need to call <variable>.close(), as soon as you leave the context, # close() is called.
urllib
is, frankly, a messy library. Kenneth Reiz has written an awesome wrapper around it, which automatically deals with underlying HTTP stuff, allowing you to focus on the actual scraping. Here's an example:import requests request = requests.get("http://httpbin.org") # requests automatically handles all the rest. assert request.status_code == 200 webpage = request.text # operate on request.text (...)
As you can see, it saves you a lot of work and time! The full documentation can be found here.
You should use
global
only when there is no other choice. Global variables make your code much harder to debug and pretty much impossible to test. For more reading up:Some of your variables are badly named. Take
buf
ortemp
. You may want to rename these tobuffer
andget_notification
.I suggest your
check_for_update()
function takes arguments likewebsite_path
andurl
instead of relying on user input. This way, you could import this code as a module and call<module>.check_for_update(website_path="website.txt", url=<url>)
. If you were to make this script into a module, you should probably putmain()
in a special kind ofif
-statement, called a name guard:if __name__ == "__main__": main()
There's a real difference between
==
(equals) andis
(is). Shortly said,a == b
checks if the value ofa
(or, more strictly put,a.__eq__()
) is equal to that ofb
, whereasa is b
checks ifa
is literally the same object (in memory) asb
(see this StackOverflow question for more details). So, when comparing to the value ofTrue
orFalse
, you need to use==
. But since Python automatically compares values in anif
statement toTrue
, you can leave all of that out and just do:if <expression>: # Functionally the same as `if <expression> == True:` if not <expression>: # Functionally the same as `if <expression> == False:`