A recent question inspired me to write a script for this task:
- Shorten URLs in specified text files using googleapis.com
- Handle multiple URLs, whether on the same line or multiple lines
The original question tried to do this in Bash, but it gets quite tricky when it comes to handling multiple URLs, so I decided to rewrite in Python:
#!/usr/bin/env python
import sys
import os
import re
import urllib2
import json
re_url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
def shorten_url(url):
print("Shortening url {} ...".format(url))
post_url = 'https://www.googleapis.com/urlshortener/v1/url?fields=id'
postdata = {'longUrl': url}
headers = {'Content-Type': 'application/json'}
req = urllib2.Request(post_url, json.dumps(postdata), headers)
return json.loads(urllib2.urlopen(req).read())['id']
def shorten_urls_in_text(text):
found_urls = False
for url in set(re_url_pattern.findall(text)):
short_url = shorten_url(url)
text = text.replace(url, short_url)
found_urls = True
return found_urls, text
def shorten_urls_in_file(path):
text = read_file(path)
found_urls, text = shorten_urls_in_text(text)
if found_urls:
write_file(path, text)
def shorten_urls_in_files(*paths):
for path in paths:
if os.path.isfile(path):
print('Shortening urls in file {} ...'.format(path))
shorten_urls_in_file(path)
else:
print('No such file: {}'.format(path))
def read_file(path):
with open(path) as fh:
return fh.read()
def write_file(path, text):
with open(path, 'w') as fh:
fh.write(text)
if __name__ == '__main__':
shorten_urls_in_files(*sys.argv[1:])
And some unit tests:
#!/usr/bin/env python
import unittest
from test.test_support import run_unittest
from shorten_urls import shorten_url
from shorten_urls import shorten_urls_in_text
sample_url1 = 'http://news.ycombinator.com'
sample_url1_shortened = 'http://goo.gl/ovZB'
class UrlShortenerTestCase(unittest.TestCase):
def test_shorten_url(self):
self.assertEqual(sample_url1_shortened, shorten_url(sample_url1))
def test_replace_nothing(self):
orig = 'hello'
found_urls, text = shorten_urls_in_text(orig)
self.assertFalse(found_urls)
self.assertEqual(orig, text)
def assert_correct_replacement(self, fmt):
orig = fmt.format(url=sample_url1)
expected = fmt.format(url=sample_url1_shortened)
found_urls, text = shorten_urls_in_text(orig)
self.assertTrue(found_urls)
self.assertEqual(expected, text)
def test_replace_one(self):
fmt = '''hello world {url}'''
self.assert_correct_replacement(fmt)
def test_replace_many(self):
fmt = '''hello world {url}\nnewline {url} again\n{url} last'''
self.assert_correct_replacement(fmt)
def test_replace_many_on_same_line(self):
fmt = '''hello world {url} same line {url} again'''
self.assert_correct_replacement(fmt)
def test_main():
run_unittest(UrlShortenerTestCase)
if __name__ == '__main__':
test_main()
It's not great that it needs to read the entire file in memory, but for handling relatively small files I thought it should be good enough.
I would like a review in terms of Python coding practices, style, and possible over-engineering.
I'm also wondering if urllib2
is considered more widely available than requests
. I would prefer requests
as it has simpler syntax, but I used urllib2
to make it more widely usable (if that's even true).
2 Answers 2
A few comments:
- Imports should be in alphabetical order
- Please use docstrings (PEP257)
- The
sample_url1
andsample_url1_shortened
variables should be class attributes inUrlShortenerTestCase
- I'd prefer to use double quotes in the
fmt
variable in the test methods - Maybe a test case in which the google service isn't available should be added
In the
shorten_urls_in_text
function, there's no need to use thefound_urls
variable, the code can rely just in the set that is created for the loop:def shorten_urls_in_text(text): """Shorten URLs in given text. :param text: A text that might contain long URLs :type text: str :return: If URLs were founda and the text with the URLs shortened :rtype: tuple(bool, str) """ urls = set(re_url_pattern.findall(text)) for url in urls: short_url = shorten_url(url) text = text.replace(url, short_url) return bool(urls), text
The unit tests shouldn't need the google service to be available. The
urllib2
module should be patched as described here:import json from mock import ( Mock, patch, ) class UrlShortenerTestCase(unittest.TestCase): """Test URL shortening functionality.""" def setUp(self): """Patch urllib2 with expected response from google.""" self.patcher = patch('shorten_urls.urllib2') urllib2 = self.patcher.start() urllib2.urlopen.return_value.read.return_value = ( json.dumps({'id': sample_url1_shortened})) def tearDown(self): """Undo the patching in the setup.""" self.patcher.stop()
Overall, LGTM.
Suggestions:
try ... except
would look better thanos.path.isfile()
. Even ifisfile
returnsTrue
, you still may encounter an exception trying to open it (say, insufficient privileges), so you should be ready to handle exceptions anyway. Besides, usingisfile
may lead to a race (not likely in this code), if a file is removed right between testing and opening it.The script spends most of the time in
shorten_url
, which in turn mostly waits for network IO. I'd recommend to give each invocation a thread.Original files are destroyed. I'd prefer to keep originals intact, or at least provide an option for that. Also an ability to read urls from standard input is usually desirable.