Simple python script to shorten URLs in specified text files

Question 1

A recent question inspired me to write a script for this task:

Shorten URLs in specified text files using googleapis.com
Handle multiple URLs, whether on the same line or multiple lines

The original question tried to do this in Bash, but it gets quite tricky when it comes to handling multiple URLs, so I decided to rewrite in Python:

#!/usr/bin/env python
import sys
import os
import re
import urllib2
import json
re_url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
def shorten_url(url):
 print("Shortening url {} ...".format(url))
 post_url = 'https://www.googleapis.com/urlshortener/v1/url?fields=id'
 postdata = {'longUrl': url}
 headers = {'Content-Type': 'application/json'}
 req = urllib2.Request(post_url, json.dumps(postdata), headers)
 return json.loads(urllib2.urlopen(req).read())['id']
def shorten_urls_in_text(text):
 found_urls = False
 for url in set(re_url_pattern.findall(text)):
 short_url = shorten_url(url)
 text = text.replace(url, short_url)
 found_urls = True
 return found_urls, text
def shorten_urls_in_file(path):
 text = read_file(path)
 found_urls, text = shorten_urls_in_text(text)
 if found_urls:
 write_file(path, text)
def shorten_urls_in_files(*paths):
 for path in paths:
 if os.path.isfile(path):
 print('Shortening urls in file {} ...'.format(path))
 shorten_urls_in_file(path)
 else:
 print('No such file: {}'.format(path))
def read_file(path):
 with open(path) as fh:
 return fh.read()
def write_file(path, text):
 with open(path, 'w') as fh:
 fh.write(text)
if __name__ == '__main__':
 shorten_urls_in_files(*sys.argv[1:])

And some unit tests:

#!/usr/bin/env python
import unittest
from test.test_support import run_unittest
from shorten_urls import shorten_url
from shorten_urls import shorten_urls_in_text
sample_url1 = 'http://news.ycombinator.com'
sample_url1_shortened = 'http://goo.gl/ovZB'
class UrlShortenerTestCase(unittest.TestCase):
 def test_shorten_url(self):
 self.assertEqual(sample_url1_shortened, shorten_url(sample_url1))
 def test_replace_nothing(self):
 orig = 'hello'
 found_urls, text = shorten_urls_in_text(orig)
 self.assertFalse(found_urls)
 self.assertEqual(orig, text)
 def assert_correct_replacement(self, fmt):
 orig = fmt.format(url=sample_url1)
 expected = fmt.format(url=sample_url1_shortened)
 found_urls, text = shorten_urls_in_text(orig)
 self.assertTrue(found_urls)
 self.assertEqual(expected, text)
 def test_replace_one(self):
 fmt = '''hello world {url}'''
 self.assert_correct_replacement(fmt)
 def test_replace_many(self):
 fmt = '''hello world {url}\nnewline {url} again\n{url} last'''
 self.assert_correct_replacement(fmt)
 def test_replace_many_on_same_line(self):
 fmt = '''hello world {url} same line {url} again'''
 self.assert_correct_replacement(fmt)
def test_main():
 run_unittest(UrlShortenerTestCase)
if __name__ == '__main__':
 test_main()

It's not great that it needs to read the entire file in memory, but for handling relatively small files I thought it should be good enough.

I would like a review in terms of Python coding practices, style, and possible over-engineering.

I'm also wondering if urllib2 is considered more widely available than requests. I would prefer requests as it has simpler syntax, but I used urllib2 to make it more widely usable (if that's even true).

Question 2

A few comments:

Imports should be in alphabetical order
Please use docstrings (PEP257)
The sample_url1 and sample_url1_shortened variables should be class attributes in UrlShortenerTestCase
I'd prefer to use double quotes in the fmt variable in the test methods
Maybe a test case in which the google service isn't available should be added

In the shorten_urls_in_text function, there's no need to use the found_urls variable, the code can rely just in the set that is created for the loop:

def shorten_urls_in_text(text):
 """Shorten URLs in given text.
 :param text: A text that might contain long URLs
 :type text: str
 :return: If URLs were founda and the text with the URLs shortened
 :rtype: tuple(bool, str)
 """
 urls = set(re_url_pattern.findall(text))
 for url in urls:
 short_url = shorten_url(url)
 text = text.replace(url, short_url)
 return bool(urls), text

The unit tests shouldn't need the google service to be available. The urllib2 module should be patched as described here:

import json
from mock import (
 Mock,
 patch,
)
class UrlShortenerTestCase(unittest.TestCase):
 """Test URL shortening functionality."""
 def setUp(self):
 """Patch urllib2 with expected response from google."""
 self.patcher = patch('shorten_urls.urllib2')
 urllib2 = self.patcher.start()
 urllib2.urlopen.return_value.read.return_value = (
 json.dumps({'id': sample_url1_shortened}))
 def tearDown(self):
 """Undo the patching in the setup."""
 self.patcher.stop()

Question 3

Overall, LGTM.

Suggestions:

try ... except would look better than os.path.isfile(). Even if isfile returns True, you still may encounter an exception trying to open it (say, insufficient privileges), so you should be ready to handle exceptions anyway. Besides, using isfile may lead to a race (not likely in this code), if a file is removed right between testing and opening it.
The script spends most of the time in shorten_url, which in turn mostly waits for network IO. I'd recommend to give each invocation a thread.
Original files are destroyed. I'd prefer to keep originals intact, or at least provide an option for that. Also an ability to read urls from standard input is usually desirable.

jcollado jcolladojcollado 1,3987 silver badges8 bronze badges · Accepted Answer · 2014-07-25 08:26:32Z

A few comments:

Imports should be in alphabetical order
Please use docstrings (PEP257)
The sample_url1 and sample_url1_shortened variables should be class attributes in UrlShortenerTestCase
I'd prefer to use double quotes in the fmt variable in the test methods
Maybe a test case in which the google service isn't available should be added

In the shorten_urls_in_text function, there's no need to use the found_urls variable, the code can rely just in the set that is created for the loop:

def shorten_urls_in_text(text):
 """Shorten URLs in given text.
 :param text: A text that might contain long URLs
 :type text: str
 :return: If URLs were founda and the text with the URLs shortened
 :rtype: tuple(bool, str)
 """
 urls = set(re_url_pattern.findall(text))
 for url in urls:
 short_url = shorten_url(url)
 text = text.replace(url, short_url)
 return bool(urls), text

The unit tests shouldn't need the google service to be available. The urllib2 module should be patched as described here:

import json
from mock import (
 Mock,
 patch,
)
class UrlShortenerTestCase(unittest.TestCase):
 """Test URL shortening functionality."""
 def setUp(self):
 """Patch urllib2 with expected response from google."""
 self.patcher = patch('shorten_urls.urllib2')
 urllib2 = self.patcher.start()
 urllib2.urlopen.return_value.read.return_value = (
 json.dumps({'id': sample_url1_shortened}))
 def tearDown(self):
 """Undo the patching in the setup."""
 self.patcher.stop()

Stack Exchange Network

Simple python script to shorten URLs in specified text files

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Simple python script to shorten URLs in specified text files

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions