Check website for update

Question 1

I have written a program in Python that checks websites for updates (it checks in anything on the website changes) and notifies the user. The idea is downloading the webpage, stripping HTML tags off it so extract the actual text and then calculating its md5 hash, so that we know when something changes. This is what my code looks right now (it's working on SOME websites). I would appreciate feedback about my style / logic / possible features that could be interesting etc. I also would like to know how to find out on which websites my code works and on which websites it doesn't work and why (for example if there are dynamic numbers that change it will mess up the MD5 hash).

import vlc # needed for the music feature
import hashlib
import os
import time
from datetime import datetime
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
from urllib.error import URLError, HTTPError
from pushover import init, Client # push notifications on your phone
# set the path of the music if u use the feature
# make sure to set up pushover on your phone before u use it
def check_for_update():
 if(os.path.isfile("website.txt")):
 req = Request(url)
 try:
 response = urlopen(req)
 except HTTPError as e:
 print('The server couldn\'t fulfill the request.')
 print('Error code: ', e.code)
 except URLError as e:
 print('We failed to reach a server.')
 print('Reason: ', e.reason)
 html = urlopen(url).read()
 soup = BeautifulSoup(html, "lxml")
 for script in soup(["script", "style"]):
 script.extract()
 text = soup.get_text()
 lines = (line.strip() for line in text.splitlines())
 chunks = (phrase.strip() for line in lines
 for phrase in line.split(" "))
 text = '\n'.join(chunk for chunk in chunks if chunk)
 file = open("website_new.txt", "w")
 file.write(text)
 file.close()
 hasher = hashlib.md5()
 with open('website_new.txt', 'rb') as afile:
 buf = afile.read()
 hasher.update(buf)
 global new_md5sum
 new_md5sum = hasher.hexdigest()
 else:
 html = urlopen(url).read()
 soup = BeautifulSoup(html, "lxml")
 for script in soup(["script", "style"]):
 script.extract()
 text = soup.get_text()
 lines = (line.strip() for line in text.splitlines())
 chunks = (phrase.strip() for line in lines
 for phrase in line.split(" "))
 text = '\n'.join(chunk for chunk in chunks if chunk)
 file = open("website.txt", "w")
 file.write(text)
 file.close()
 hasher = hashlib.md5()
 with open('website.txt', 'rb') as afile:
 buf = afile.read()
 hasher.update(buf)
 global original_md5sum
 original_md5sum = hasher.hexdigest()
 check_for_update()
def main():
 global url
 url = input("Paste the URL you want to check for updates: ")
 global push
 while True:
 temp = input("\nDo you want to get a notification \
to your phone when the website has been changed? (y/n): ")
 if (temp != "y" and temp != "n"):
 print("Error: Please enter y or n")
 else:
 if temp == "y":
 push = True
 print("Notifications to your phone have been turned ON\n")
 break
 else:
 print("Notifications to your phone have been turned OFF\n")
 break
 global music
 while True:
 temp = input("Do you want to play a song \
when the website has been changed? (y/n): ")
 if (temp != "y" and temp != "n"):
 print("Error: Please enter y or n")
 else:
 if temp == "y":
 music = True
 print("The music feature has been turned ON\n")
 break
 else:
 print("The music feature has been turned OFF\n")
 break
 global update_timer
 while True:
 temp = input("How often do you want to check \
the website for updates? Enter it in seconds (min. 20): ")
 if (temp.isdigit()):
 temp = int(temp)
 if temp > 19:
 print("The website will be checked for \
updates every " + str(temp) + " seconds\n")
 update_timer = temp
 break
 else:
 print("Make sure to enter a value bigger than 19\n")
 else:
 print("Please enter an integer (which has to be bigger than 19)\n")
 path = os.path.dirname(os.path.realpath(__file__))
 try:
 os.remove(path + "/website.txt")
 except OSError:
 pass
 try:
 os.remove(path + "/website_new.txt")
 except OSError:
 pass
 original_md5sum = ""
 new_md5sum = ""
 check_for_update()
 mainloop()
def mainloop():
 while True:
 check_for_update()
 '''
 print("Original: ", original_md5sum)
 print("New: ", new_md5sum)
 '''
 if original_md5sum == new_md5sum:
 print("Website hasn't been updated yet... " +
 datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
 else:
 print("Website hat been updated! " +
 datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
 if push is True:
 init("<token>")
 Client("<client_id>\
").send_message("Website has been updated!", title="Website update")
 if music is True:
 # example: file:///home/anon/Music/song.mp3
 p = vlc.MediaPlayer("file://<path>")
 p.play()
 time.sleep(60)
 p.stop()
 break
 time.sleep(update_timer)
main()

Question 2

First things first, I really like your idea and I think with a couple of small changes it would be a really nice utility to use! To start out with, here's a couple of suggestions regarding code style:

imports are normally grouped:

standard library imports
related third party imports
local application / library specific imports

these categories should be separated by a single blank line:

from datetime import datetime
import hashlib
import os
import time
from urllib.request import urlopen, Request
from urllib.error import URLError, HTTPError
from pushover import init, Client
from bs4 import BeautifulSoup
import vlc

Whitespace is good, if it's used sparingly. Excess use of whitespace can actually reduce readability, doing only harm to your code. I personally don't put a blank line in between a function signature and its first line, but that too is mostly up to you.

Those are the only real coding style issues I can find. Here's some other remarks about your code:

if-statements don't have to be wrapped in parentheses:
```
if(os.path.isfile("website.txt")):
```
... could simply become:
```
if os.path.isfile("website.txt"):
```
You can use a context manager for opening files, which is slightly more pythonic than manually calling open() and close(). The issue with dealing with files is it's essential to close() the file to prevent any data from being corrupted or lost. Thus, you need to wrap such a code block in a try / except-construct:
```
try:
 f = open(<filename>, <mode>)
 # File operations here
finally:
 f.close()
```
... if you forget to wrap it in such a statement, if an exception is raised while operating on a file, data may be damaged or lost. You can use context managers (the with statement) to automatically close the file when you release the resource:
```
with open(<filename>, <mode>) as <variable>:
 # File operations follow
# No need to call <variable>.close(), as soon as you leave the context,
# close() is called.
```
urllib is, frankly, a messy library. Kenneth Reiz has written an awesome wrapper around it, which automatically deals with underlying HTTP stuff, allowing you to focus on the actual scraping. Here's an example:
```
import requests
request = requests.get("http://httpbin.org")
# requests automatically handles all the rest.
assert request.status_code == 200
webpage = request.text
# operate on request.text (...)
```
As you can see, it saves you a lot of work and time! The full documentation can be found here.
You should use global only when there is no other choice. Global variables make your code much harder to debug and pretty much impossible to test. For more reading up:
- Why are global variables evil?
- Python anti-patterns - global
Some of your variables are badly named. Take buf or temp. You may want to rename these to buffer and get_notification.
I suggest your check_for_update() function takes arguments like website_path and url instead of relying on user input. This way, you could import this code as a module and call <module>.check_for_update(website_path="website.txt", url=<url>). If you were to make this script into a module, you should probably put main() in a special kind of if-statement, called a name guard:
```
if __name__ == "__main__":
 main()
```
What does if __name__ == "__main__" do?
There's a real difference between == (equals) and is (is). Shortly said, a == b checks if the value of a (or, more strictly put, a.__eq__()) is equal to that of b, whereas a is b checks if a is literally the same object (in memory) as b (see this StackOverflow question for more details). So, when comparing to the value of True or False, you need to use ==. But since Python automatically compares values in an if statement to True, you can leave all of that out and just do:
```
if <expression>:
 # Functionally the same as `if <expression> == True:`
if not <expression>:
 # Functionally the same as `if <expression> == False:`
```

Daniel Daniel 4,6222 gold badges18 silver badges40 bronze badges · Accepted Answer · 2017-08-13 18:09:43Z

First things first, I really like your idea and I think with a couple of small changes it would be a really nice utility to use! To start out with, here's a couple of suggestions regarding code style:

imports are normally grouped:

standard library imports
related third party imports
local application / library specific imports

these categories should be separated by a single blank line:

from datetime import datetime
import hashlib
import os
import time
from urllib.request import urlopen, Request
from urllib.error import URLError, HTTPError
from pushover import init, Client
from bs4 import BeautifulSoup
import vlc

Whitespace is good, if it's used sparingly. Excess use of whitespace can actually reduce readability, doing only harm to your code. I personally don't put a blank line in between a function signature and its first line, but that too is mostly up to you.

Those are the only real coding style issues I can find. Here's some other remarks about your code:

if-statements don't have to be wrapped in parentheses:
```
if(os.path.isfile("website.txt")):
```
... could simply become:
```
if os.path.isfile("website.txt"):
```
You can use a context manager for opening files, which is slightly more pythonic than manually calling open() and close(). The issue with dealing with files is it's essential to close() the file to prevent any data from being corrupted or lost. Thus, you need to wrap such a code block in a try / except-construct:
```
try:
 f = open(<filename>, <mode>)
 # File operations here
finally:
 f.close()
```
... if you forget to wrap it in such a statement, if an exception is raised while operating on a file, data may be damaged or lost. You can use context managers (the with statement) to automatically close the file when you release the resource:
```
with open(<filename>, <mode>) as <variable>:
 # File operations follow
# No need to call <variable>.close(), as soon as you leave the context,
# close() is called.
```
urllib is, frankly, a messy library. Kenneth Reiz has written an awesome wrapper around it, which automatically deals with underlying HTTP stuff, allowing you to focus on the actual scraping. Here's an example:
```
import requests
request = requests.get("http://httpbin.org")
# requests automatically handles all the rest.
assert request.status_code == 200
webpage = request.text
# operate on request.text (...)
```
As you can see, it saves you a lot of work and time! The full documentation can be found here.
You should use global only when there is no other choice. Global variables make your code much harder to debug and pretty much impossible to test. For more reading up:
- Why are global variables evil?
- Python anti-patterns - global
Some of your variables are badly named. Take buf or temp. You may want to rename these to buffer and get_notification.
I suggest your check_for_update() function takes arguments like website_path and url instead of relying on user input. This way, you could import this code as a module and call <module>.check_for_update(website_path="website.txt", url=<url>). If you were to make this script into a module, you should probably put main() in a special kind of if-statement, called a name guard:
```
if __name__ == "__main__":
 main()
```
What does if __name__ == "__main__" do?
There's a real difference between == (equals) and is (is). Shortly said, a == b checks if the value of a (or, more strictly put, a.__eq__()) is equal to that of b, whereas a is b checks if a is literally the same object (in memory) as b (see this StackOverflow question for more details). So, when comparing to the value of True or False, you need to use ==. But since Python automatically compares values in an if statement to True, you can leave all of that out and just do:
```
if <expression>:
 # Functionally the same as `if <expression> == True:`
if not <expression>:
 # Functionally the same as `if <expression> == False:`
```

Stack Exchange Network

Check website for update

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Check website for update

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions