3
\$\begingroup\$

I am building a Python app in which user can maintain wishlist of the products. I only support few e-commerce sites and do not support country specific sites (e.g I may support amazon.com but not amazon.in). I do not support mobile version of the URL (e.g. http://m.amazon.com) and also I am not interested in the query string part of the URL (and I don't want it also).

Following is the code and also the test cases. Though it seems to be working, I am not happy with the code. It looks hackish to me. I would really appreciate improving get_vendor function. Do you find it readable?

from urlparse import urlparse
supported_vendors = ['flipkart.com', 'homeshop18.com', 'snapdeal.com', 
 'myntra.com', 'www.flipkart.com', 'www.homeshop18.com', 
 'www.snapdeal.com', 'www.myntra.com']
test_urls =['http://www.myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
 'www.myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
 'https://myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
 'http://.myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
 'htt://www.myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
 'http://blahblah.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
 'ftp://myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!']
test_expected_results = [True, True, True, False, False, False, False]
def get_vendor(url):
 def add_http(url):
 return 'http://'+url
 def get_url(parsed_url):
 return 'http://'+parsed_url.netloc+parsed_url.path
 if not urlparse(url).scheme:
 parsed_url = urlparse(add_http(url))
 elif urlparse(url).scheme not in ['http', 'https']:
 return (None, url)
 else:
 parsed_url = urlparse(url)
 if parsed_url.netloc not in supported_vendors:
 return (None, url)
 else:
 return (parsed_url.netloc, get_url(parsed_url))
if __name__ == '__main__':
 # for i in map(get_vendor, test_urls):
 # print i
 assert(map(lambda x: bool(x[0]), map(get_vendor, test_urls)) == test_expected_results)

based on what vendor returns I apply different scraping functions. Here is the next code:

HS18 = ['homeshop18.com', 'www.homeshop18.com']
SD = ['snapdeal.com', 'www.snapdeal.com']
FS = ['flipkart.com', 'www.flipkart.com']
MYN = ['myntra.com', 'www.myntra.com']
vendor, url = get_vendor(url)
if not vendor:
 exit # or throw error
product_doc = search_for_url_in_database(url)
if not product_doc:
 if vendor in FS:
 product_url, product_name, product_img_url, product_price = get_flipkart_product_meta(url)
 elif vendor in HS18:
 product_url, product_name, product_img_url, product_price = get_homeshop18_product_meta(url)
 elif vendor in SD:
 product_url, product_name, product_img_url, product_price = get_snapdeal_product_meta(url)
 else:
 product_url, product_name, product_img_url, product_price = get_myntra_product_meta(url)
...

This last if block I can change it to:

if not product_doc:
 if vendor in FS:
 get_product_meta = get_flipkart_product_meta
 elif vendor in HS18:
 get_product_meta = get_homeshop18_product_meta
 elif vendor in SD:
 get_product_meta = get_snapdeal_product_meta
 else:
 get_product_meta = get_myntra_product_meta
 product_url, product_name, product_img_url, product_price = get_product_meta(url)
...
  1. What do you say about this second if block where I am assigning function object to a variable and calling it later. Should I be doing like this? What's the suggested way?

  2. The lists defined at top, has redundant data. Is there any way to make it better?

  3. I really would like to keep leading www if the url has (even though avoiding this do not harm). So how do I do this with cleaner way? or you suggest to remove www completely?

Generally how do you guys handle situation like this, when you have to call a function based on if-else blocks?

EDIT: Following code was suggested by someone on reddit. Looks clean to me:

PRODUCT_METAS = {
 'homeshop18.com': get_homeshop18_product_meta,
 'snapdeal.com': get_snapdeal_product_meta,
 'flipkart.com': get_flipkart_product_meta,
 ' myntra.com': get_myntra_product_meta,
}
vendor, url = get_vendor(url)
if vendor is None:
 exit()
product_doc = search_for_url_in_database(url)
if not product_doc:
 if vendor.startswith('www.'):
 vendor = vendor[len('www.'):]
 try:
 product_url, product_name, product_img_url, product_price = PRODUCT_METAS[vendor](url)
 except KeyError:
 exit
asked Feb 1, 2014 at 13:26
\$\endgroup\$
2
  • \$\begingroup\$ You allow https but get_url always uses http. Is that right? \$\endgroup\$ Commented Feb 2, 2014 at 9:19
  • \$\begingroup\$ yes. you are right. These e-commerce sites do not have problem if the URL is HTTP or HTTPS. Or else in get_url I have to check whether scheme is http or https, based on that I have to append HTTP or HTTPS. Just to avoid an if block, which I thought unnecessary since these sites do not complain, I thought of going with HTTP. And thank you for going over my code. \$\endgroup\$ Commented Feb 2, 2014 at 9:39

1 Answer 1

3
\$\begingroup\$

The two helper functions feel redundant to me; 'http://'+url is clear enough as is, and there is an urlunparse in urlparse module.

My proposal:

def get_vendor(url):
 parsed_url = urlparse(url)
 if not parsed_url.scheme:
 parsed_url = urlparse('http://'+url)
 scheme, netloc, path, params, query, fragment = parsed_url
 if scheme in ['http', 'https'] and netloc in supported_vendors:
 return (netloc, urlunparse((scheme, netloc, path, '','','')))
 else:
 return (None, url)

To deal with an optional www. in front, you could avoid complicating the get_vendor function by adding it programmatically to each known URL. Now, if you needed to add a vendor that's only available with or without www., you would only have to change this back how it was and add to that list.

supported_vendors = ['flipkart.com', 'homeshop18.com', 'snapdeal.com', 
 'myntra.com']
supported_vendors += ['www.' + x for x in supported_vendors]
answered Feb 2, 2014 at 10:17
\$\endgroup\$
2
  • \$\begingroup\$ A really much cleaner code! I did not know about urlunparse that's why I was adding HTTP and another helper function. Now with your code, it actually retains the url scheme. Thank you very much! \$\endgroup\$ Commented Feb 2, 2014 at 11:50
  • \$\begingroup\$ Can I do anything about leading www ? because of this I have to maintain the list with redundant data (i.e. with and without www). I just checked and these e-commerce sites do support without leading www. But question should I be removing it? Because once the get_vendor function returns vendor and url, based on that I have to do different actions depending on the vendor. I updated the OP, would appreciate if you can have a look again. \$\endgroup\$ Commented Feb 2, 2014 at 14:02

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.