I am building a Python app in which user can maintain wishlist of the products. I only support few e-commerce sites and do not support country specific sites (e.g I may support amazon.com but not amazon.in). I do not support mobile version of the URL (e.g. http://m.amazon.com) and also I am not interested in the query string part of the URL (and I don't want it also).
Following is the code and also the test cases. Though it seems to be working, I am not happy with the code. It looks hackish to me. I would really appreciate improving get_vendor
function. Do you find it readable?
from urlparse import urlparse
supported_vendors = ['flipkart.com', 'homeshop18.com', 'snapdeal.com',
'myntra.com', 'www.flipkart.com', 'www.homeshop18.com',
'www.snapdeal.com', 'www.myntra.com']
test_urls =['http://www.myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
'www.myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
'https://myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
'http://.myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
'htt://www.myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
'http://blahblah.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
'ftp://myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!']
test_expected_results = [True, True, True, False, False, False, False]
def get_vendor(url):
def add_http(url):
return 'http://'+url
def get_url(parsed_url):
return 'http://'+parsed_url.netloc+parsed_url.path
if not urlparse(url).scheme:
parsed_url = urlparse(add_http(url))
elif urlparse(url).scheme not in ['http', 'https']:
return (None, url)
else:
parsed_url = urlparse(url)
if parsed_url.netloc not in supported_vendors:
return (None, url)
else:
return (parsed_url.netloc, get_url(parsed_url))
if __name__ == '__main__':
# for i in map(get_vendor, test_urls):
# print i
assert(map(lambda x: bool(x[0]), map(get_vendor, test_urls)) == test_expected_results)
based on what vendor returns I apply different scraping functions. Here is the next code:
HS18 = ['homeshop18.com', 'www.homeshop18.com']
SD = ['snapdeal.com', 'www.snapdeal.com']
FS = ['flipkart.com', 'www.flipkart.com']
MYN = ['myntra.com', 'www.myntra.com']
vendor, url = get_vendor(url)
if not vendor:
exit # or throw error
product_doc = search_for_url_in_database(url)
if not product_doc:
if vendor in FS:
product_url, product_name, product_img_url, product_price = get_flipkart_product_meta(url)
elif vendor in HS18:
product_url, product_name, product_img_url, product_price = get_homeshop18_product_meta(url)
elif vendor in SD:
product_url, product_name, product_img_url, product_price = get_snapdeal_product_meta(url)
else:
product_url, product_name, product_img_url, product_price = get_myntra_product_meta(url)
...
This last if block I can change it to:
if not product_doc:
if vendor in FS:
get_product_meta = get_flipkart_product_meta
elif vendor in HS18:
get_product_meta = get_homeshop18_product_meta
elif vendor in SD:
get_product_meta = get_snapdeal_product_meta
else:
get_product_meta = get_myntra_product_meta
product_url, product_name, product_img_url, product_price = get_product_meta(url)
...
What do you say about this second
if
block where I am assigning function object to a variable and calling it later. Should I be doing like this? What's the suggested way?The lists defined at top, has redundant data. Is there any way to make it better?
I really would like to keep leading
www
if the url has (even though avoiding this do not harm). So how do I do this with cleaner way? or you suggest to removewww
completely?
Generally how do you guys handle situation like this, when you have to call a function based on if-else blocks?
EDIT: Following code was suggested by someone on reddit. Looks clean to me:
PRODUCT_METAS = {
'homeshop18.com': get_homeshop18_product_meta,
'snapdeal.com': get_snapdeal_product_meta,
'flipkart.com': get_flipkart_product_meta,
' myntra.com': get_myntra_product_meta,
}
vendor, url = get_vendor(url)
if vendor is None:
exit()
product_doc = search_for_url_in_database(url)
if not product_doc:
if vendor.startswith('www.'):
vendor = vendor[len('www.'):]
try:
product_url, product_name, product_img_url, product_price = PRODUCT_METAS[vendor](url)
except KeyError:
exit
1 Answer 1
The two helper functions feel redundant to me; 'http://'+url
is clear enough as is, and there is an urlunparse
in urlparse
module.
My proposal:
def get_vendor(url):
parsed_url = urlparse(url)
if not parsed_url.scheme:
parsed_url = urlparse('http://'+url)
scheme, netloc, path, params, query, fragment = parsed_url
if scheme in ['http', 'https'] and netloc in supported_vendors:
return (netloc, urlunparse((scheme, netloc, path, '','','')))
else:
return (None, url)
To deal with an optional www.
in front, you could avoid complicating the get_vendor
function by adding it programmatically to each known URL. Now, if you needed to add a vendor that's only available with or without www.
, you would only have to change this back how it was and add to that list.
supported_vendors = ['flipkart.com', 'homeshop18.com', 'snapdeal.com',
'myntra.com']
supported_vendors += ['www.' + x for x in supported_vendors]
-
\$\begingroup\$ A really much cleaner code! I did not know about urlunparse that's why I was adding HTTP and another helper function. Now with your code, it actually retains the url scheme. Thank you very much! \$\endgroup\$avi– avi2014年02月02日 11:50:58 +00:00Commented Feb 2, 2014 at 11:50
-
\$\begingroup\$ Can I do anything about leading
www
? because of this I have to maintain the list with redundant data (i.e. with and withoutwww
). I just checked and these e-commerce sites do support without leadingwww
. But question should I be removing it? Because once theget_vendor
function returnsvendor
andurl
, based on that I have to do different actions depending on thevendor
. I updated the OP, would appreciate if you can have a look again. \$\endgroup\$avi– avi2014年02月02日 14:02:51 +00:00Commented Feb 2, 2014 at 14:02
get_url
always uses http. Is that right? \$\endgroup\$get_url
I have to check whetherscheme
ishttp
orhttps
, based on that I have to append HTTP or HTTPS. Just to avoid anif
block, which I thought unnecessary since these sites do not complain, I thought of going with HTTP. And thank you for going over my code. \$\endgroup\$