Finding e-commerce site of the product URL given and see if it is supported

Question 1

I am building a Python app in which user can maintain wishlist of the products. I only support few e-commerce sites and do not support country specific sites (e.g I may support amazon.com but not amazon.in). I do not support mobile version of the URL (e.g. http://m.amazon.com) and also I am not interested in the query string part of the URL (and I don't want it also).

Following is the code and also the test cases. Though it seems to be working, I am not happy with the code. It looks hackish to me. I would really appreciate improving get_vendor function. Do you find it readable?

from urlparse import urlparse
supported_vendors = ['flipkart.com', 'homeshop18.com', 'snapdeal.com', 
 'myntra.com', 'www.flipkart.com', 'www.homeshop18.com', 
 'www.snapdeal.com', 'www.myntra.com']
test_urls =['http://www.myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
 'www.myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
 'https://myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
 'http://.myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
 'htt://www.myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
 'http://blahblah.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!',
 'ftp://myntra.com/sports-shoes/puma/puma-men-grey-kuris-ii-ind-running-shoes/107455/buy?searchQuery=sports-shoes&serp=1&uq=false#!']
test_expected_results = [True, True, True, False, False, False, False]
def get_vendor(url):
 def add_http(url):
 return 'http://'+url
 def get_url(parsed_url):
 return 'http://'+parsed_url.netloc+parsed_url.path
 if not urlparse(url).scheme:
 parsed_url = urlparse(add_http(url))
 elif urlparse(url).scheme not in ['http', 'https']:
 return (None, url)
 else:
 parsed_url = urlparse(url)
 if parsed_url.netloc not in supported_vendors:
 return (None, url)
 else:
 return (parsed_url.netloc, get_url(parsed_url))
if __name__ == '__main__':
 # for i in map(get_vendor, test_urls):
 # print i
 assert(map(lambda x: bool(x[0]), map(get_vendor, test_urls)) == test_expected_results)

based on what vendor returns I apply different scraping functions. Here is the next code:

HS18 = ['homeshop18.com', 'www.homeshop18.com']
SD = ['snapdeal.com', 'www.snapdeal.com']
FS = ['flipkart.com', 'www.flipkart.com']
MYN = ['myntra.com', 'www.myntra.com']
vendor, url = get_vendor(url)
if not vendor:
 exit # or throw error
product_doc = search_for_url_in_database(url)
if not product_doc:
 if vendor in FS:
 product_url, product_name, product_img_url, product_price = get_flipkart_product_meta(url)
 elif vendor in HS18:
 product_url, product_name, product_img_url, product_price = get_homeshop18_product_meta(url)
 elif vendor in SD:
 product_url, product_name, product_img_url, product_price = get_snapdeal_product_meta(url)
 else:
 product_url, product_name, product_img_url, product_price = get_myntra_product_meta(url)
...

This last if block I can change it to:

if not product_doc:
 if vendor in FS:
 get_product_meta = get_flipkart_product_meta
 elif vendor in HS18:
 get_product_meta = get_homeshop18_product_meta
 elif vendor in SD:
 get_product_meta = get_snapdeal_product_meta
 else:
 get_product_meta = get_myntra_product_meta
 product_url, product_name, product_img_url, product_price = get_product_meta(url)
...

What do you say about this second if block where I am assigning function object to a variable and calling it later. Should I be doing like this? What's the suggested way?
The lists defined at top, has redundant data. Is there any way to make it better?
I really would like to keep leading www if the url has (even though avoiding this do not harm). So how do I do this with cleaner way? or you suggest to remove www completely?

Generally how do you guys handle situation like this, when you have to call a function based on if-else blocks?

EDIT: Following code was suggested by someone on reddit. Looks clean to me:

PRODUCT_METAS = {
 'homeshop18.com': get_homeshop18_product_meta,
 'snapdeal.com': get_snapdeal_product_meta,
 'flipkart.com': get_flipkart_product_meta,
 ' myntra.com': get_myntra_product_meta,
}
vendor, url = get_vendor(url)
if vendor is None:
 exit()
product_doc = search_for_url_in_database(url)
if not product_doc:
 if vendor.startswith('www.'):
 vendor = vendor[len('www.'):]
 try:
 product_url, product_name, product_img_url, product_price = PRODUCT_METAS[vendor](url)
 except KeyError:
 exit

Question 2

You allow https but get_url always uses http. Is that right?

Question 3

yes. you are right. These e-commerce sites do not have problem if the URL is HTTP or HTTPS. Or else in get_url I have to check whether scheme is http or https, based on that I have to append HTTP or HTTPS. Just to avoid an if block, which I thought unnecessary since these sites do not complain, I thought of going with HTTP. And thank you for going over my code.

Question 4

The two helper functions feel redundant to me; 'http://'+url is clear enough as is, and there is an urlunparse in urlparse module.

My proposal:

def get_vendor(url):
 parsed_url = urlparse(url)
 if not parsed_url.scheme:
 parsed_url = urlparse('http://'+url)
 scheme, netloc, path, params, query, fragment = parsed_url
 if scheme in ['http', 'https'] and netloc in supported_vendors:
 return (netloc, urlunparse((scheme, netloc, path, '','','')))
 else:
 return (None, url)

To deal with an optional www. in front, you could avoid complicating the get_vendor function by adding it programmatically to each known URL. Now, if you needed to add a vendor that's only available with or without www., you would only have to change this back how it was and add to that list.

supported_vendors = ['flipkart.com', 'homeshop18.com', 'snapdeal.com', 
 'myntra.com']
supported_vendors += ['www.' + x for x in supported_vendors]

Question 5

A really much cleaner code! I did not know about urlunparse that's why I was adding HTTP and another helper function. Now with your code, it actually retains the url scheme. Thank you very much!

Question 6

Can I do anything about leading www ? because of this I have to maintain the list with redundant data (i.e. with and without www). I just checked and these e-commerce sites do support without leading www. But question should I be removing it? Because once the get_vendor function returns vendor and url, based on that I have to do different actions depending on the vendor. I updated the OP, would appreciate if you can have a look again.

Janne Karila Janne Karila 10.6k21 silver badges34 bronze badges · Answer 1 · 2014-02-02 10:17:27Z

The two helper functions feel redundant to me; 'http://'+url is clear enough as is, and there is an urlunparse in urlparse module.

My proposal:

def get_vendor(url):
 parsed_url = urlparse(url)
 if not parsed_url.scheme:
 parsed_url = urlparse('http://'+url)
 scheme, netloc, path, params, query, fragment = parsed_url
 if scheme in ['http', 'https'] and netloc in supported_vendors:
 return (netloc, urlunparse((scheme, netloc, path, '','','')))
 else:
 return (None, url)

To deal with an optional www. in front, you could avoid complicating the get_vendor function by adding it programmatically to each known URL. Now, if you needed to add a vendor that's only available with or without www., you would only have to change this back how it was and add to that list.

supported_vendors = ['flipkart.com', 'homeshop18.com', 'snapdeal.com', 
 'myntra.com']
supported_vendors += ['www.' + x for x in supported_vendors]

A really much cleaner code! I did not know about urlunparse that's why I was adding HTTP and another helper function. Now with your code, it actually retains the url scheme. Thank you very much!
Can I do anything about leading www ? because of this I have to maintain the list with redundant data (i.e. with and without www). I just checked and these e-commerce sites do support without leading www. But question should I be removing it? Because once the get_vendor function returns vendor and url, based on that I have to do different actions depending on the vendor. I updated the OP, would appreciate if you can have a look again.

Stack Exchange Network

Finding e-commerce site of the product URL given and see if it is supported

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Finding e-commerce site of the product URL given and see if it is supported

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions