Recently, I wanted to change my vim colors to something new. So I went to the vim colors
website and then I decided that I wanted to download ALL the colors.
So I wrote this simple Python script that will download all the files from: Vim Colors
I'm still learning Python and it would be amazing to get some feedback / suggestions if possible.
import os
import json
import requests
from bs4 import BeautifulSoup
class Spider:
def __init__(self, total_pages=40):
self.base_url = "http://vimcolors.com/?page="
self.total_pages = total_pages
self.download_dir = 'colors/'
# If we don't have the download directory
if not os.path.isdir(self.download_dir):
print(self.download_dir, 'does not exist, trying to create it...')
# create it...
os.mkdir(self.download_dir)
def download(self, name, url):
# If we have already downloaded this file, just skip
if os.path.isfile(self.download_dir + name):
print('File:', name, 'already exists; skipping.')
return
try:
# Get the response
response = requests.get(url)
# If response is 404 (Not Found), just exit
if response.status_code == 404:
raise Exception('File not found')
# Create the file
with open(self.download_dir + name, 'wb') as file_path:
# Write content to the file
file_path.write(response.content)
# Confirm the download
print('Downloaded', name)
except:
# This is a very generic error, perhaps I'll change it sometime :)
print('Could not download the file', name)
pass
def crawl(self):
def repo_formatter(scheme):
return scheme['github_repo']['address'].replace('github.com', 'raw.githubusercontent.com') \
+ '/master/colors/' + scheme['name'] + '.vim'
# Loop over all the pages
for page in range(self.total_pages):
page_source = requests.get(self.base_url + str(page + 1))
plain_text = page_source.text
soup = BeautifulSoup(plain_text, 'lxml')
# Get the data
json_data = json.loads(soup.find('div', {'id': 'data'}).attrs['data-colorschemes'])
# Download the files
for data in json_data['colorschemes']:
self.download(data['name'] + '.vim', repo_formatter(data))
colors_spider = Spider()
colors_spider.crawl()
-
\$\begingroup\$ Welcome to code review, I hope you get some good answers. \$\endgroup\$pacmaninbw– pacmaninbw ♦2016年08月30日 14:12:37 +00:00Commented Aug 30, 2016 at 14:12
-
\$\begingroup\$ Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers . \$\endgroup\$Simon Forsberg– Simon Forsberg2016年08月31日 12:37:33 +00:00Commented Aug 31, 2016 at 12:37
-
\$\begingroup\$ I left the original in there. I did not delete the one I had before, but that's fine. :) thanks! \$\endgroup\$Kernel.Panic– Kernel.Panic2016年08月31日 13:19:38 +00:00Commented Aug 31, 2016 at 13:19
2 Answers 2
Very nice code in general! Here are a few points, though:
Think about using os.makedirs
instead of os.mkdir
. The former allows creating nested directories (so os.makedirs("~/long/new/path")
will create all other folders in betweetn as well, if needed).
You should never have a bare except
. This prevents you to e.g. CTRL-C to abort the program, since it will also be caught. Use at least except Exception
, which all normal exceptions inherit from (but not some special exceptions like pressing CTRL-C).
In general it is usually not the best idea to have a lot of string addition. You could use format
instead in repo_formatter
:
def repo_formatter(scheme):
base_url = scheme['github_repo']['address'].replace('github.com', 'raw.githubusercontent.com')
return '{}/master/colors/{}.vim'.format(base_url, scheme['name'])
You have some very obvious comments in your code, especially in download
. Comments should explain why you do something and not what. The latter should be obvious from your code! And in that function it is really easy to follow what happens, because you have chosen appropriate variable names :)
-
1\$\begingroup\$ @Kernel.Panic Believe it or not, at some point you'll be able to read simple code faster than the comments explaining them. Thats why they can actually hinder readability to an experienced coder. \$\endgroup\$jpaugh– jpaugh2016年08月31日 01:09:13 +00:00Commented Aug 31, 2016 at 1:09
-
\$\begingroup\$ Graipher, Did you mean "string concatenation" instead of "string addition"? \$\endgroup\$jpaugh– jpaugh2016年08月31日 01:10:06 +00:00Commented Aug 31, 2016 at 1:10
-
\$\begingroup\$ @jpaugh Well yes and no. It is more like addition in python, because they are immutable. You don't add the second string to the end, you generate a new string which has both contents of the two strings you are adding... \$\endgroup\$Graipher– Graipher2016年08月31日 07:47:10 +00:00Commented Aug 31, 2016 at 7:47
-
1\$\begingroup\$ Ah, that makes sense. I tend to think of concatenation as being a pure operation as well. \$\endgroup\$jpaugh– jpaugh2016年08月31日 17:46:14 +00:00Commented Aug 31, 2016 at 17:46
Nice script! A few points on top of what @Graipher already pointed out.
Passing url parameters to requests
This is a hacky way to pass query parameters:
url = "http://vimcolors.com/?page=" # ... page_source = requests.get(url + str(page + 1))
The proper way to do this is to pass the parameters in a dictionary:
url = "http://vimcolors.com"
# ...
page_source = requests.get(url, {'page': page + 1})
Building path strings
This is a hacky, non-portable way to build and work with path strings:
download_dir = 'colors/' # ... if os.path.isfile(download_dir + name): # ...
The proper way to do this is using os.path.join
:
download_dir = 'colors'
# ...
if os.path.isfile(os.path.join(download_dir, name)):
# ...
Don't repeat yourself
You wrote self.download_dir + name
multiple times within the same scope.
Don't repeat yourself, put it into a variable and reuse it.
Exception handling
There are several problems with this block of code:
try: # Get the response response = requests.get(url) # If response is 404 (Not Found), just exit if response.status_code == 404: raise Exception('File not found') # Create the file with open(self.download_dir + name, 'wb') as file_path: # Write content to the file file_path.write(response.content) # Confirm the download print('Downloaded', name) except: # This is a very generic error, perhaps I'll change it sometime :) print('Could not download the file', name) pass
@Graipher already mentioned to not use generic exceptions. The coding style guide has many things to say about exceptions, I suggest to read it well.
The message you print in except
may be misleading.
As the try
block includes not only the downloading,
but also the writing to the file,
if the latter raises an exception,
it will be caught, and the message printed will be misleading,
which can be extremely frustrating when trying to debug the problem.
At the minimum,
the try
block should be split.
Finally,
as this block of code is at the end of the function,
the pass
is pointless.
-
\$\begingroup\$ Didn't know those non-hacky / proper methods! Thank you, sir! :) I will definitely update my code accordingly later on. \$\endgroup\$Kernel.Panic– Kernel.Panic2016年08月30日 16:55:01 +00:00Commented Aug 30, 2016 at 16:55
-
\$\begingroup\$ I was actually trying to catch a message from exception. I guess Python doesn't have anything like that? For example, in
PHP
I would do$e->getMessage()
to fetch the proper message from exception. I couldn't find a similar solution in Python.. \$\endgroup\$Kernel.Panic– Kernel.Panic2016年08月30日 16:56:36 +00:00Commented Aug 30, 2016 at 16:56 -
1\$\begingroup\$ @Kernel.Panic stackoverflow.com/documentation/python/1788/… and stackoverflow.com/questions/1483429/… c: \$\endgroup\$Rod– Rod2016年08月30日 19:01:01 +00:00Commented Aug 30, 2016 at 19:01
Explore related questions
See similar questions with these tags.