1
\$\begingroup\$

I have been working on where I want to improve my knowledge to do a safe JSON scraping. By that is that I could pretty much do a ugly workaround and wrap it into a try-except but that is not the case. I would like to know where it goes wrong and how I can improve the code I have been working with:

import json
from json import JSONDecodeError
import requests
payload = {
 "name": None,
 "price": None,
 "sizes": None
}
headers = {'origin': 'https://www.aplace.com', 'Content-Type': 'application/json'}
with requests.get("http://api.aplace.com/sv/dam/aaltoili-iso-mehu-gren-red-sand", headers=headers) as response:
 if not response.ok:
 print(f"Requests NOT ok -> {payload}")
 try:
 # They added extra comma to fail the JSON. Fixed it by removing duplicate commas
 doc = json.loads(response.text.replace(",,", ",")).get("productData", {})
 except (JSONDecodeError, KeyError, ValueError) as err:
 print(f"Not able to parse json -> {err} -> Payload: {payload}")
 raise
if doc:
 product_name = doc.get('name', {})
 product_brand = doc.get('brandName', {})
 product_price = doc.get('markets', {}).get('6', {}).get('pricesByPricelist', {}).get('3', {}).get('price', {})
 # Added exception incase the [0] is not able to get any value from 'media'
 try:
 product_image = doc.get('media', {})[0].get('sources', {}).get('full', {}).get('url', {})
 except KeyError:
 product_image = None
 product_size = doc.get('items', {})
 try:
 if product_name and product_brand:
 payload['name'] = f"{product_brand} {product_name}"
 if product_price:
 payload['price'] = product_price.replace('\xa0', '')
 if product_image:
 payload['image'] = product_image
 if product_size:
 count = 0
 sizes = []
 sizesWithStock = []
 for value in product_size.values():
 if value.get('stockByMarket', {}).get('6', {}):
 count += value['stockByMarket']['6']
 sizes.append(value['name'])
 sizesWithStock.append(f"{value['name']} - ({value['stockByMarket']['6']})")
 payload['sizes'] = sizes
 payload['sizesWithStock'] = sizesWithStock
 payload['stock'] = count
 except Exception as err:
 print("Create notification if we hit here later on")
 raise
print(f"Finished payload -> {payload}")

My goal here is to cover all the values and if we do not find it inside the json then we use the if statements to see if we have caught a value or not.

Things that annoys me:

  1. Long nested dicts but im not sure if there is any other way, is there a better way?
  2. try-except for product_image, because its a list and I just want to grab the first in the list, is it possible to skip the try-except?
  3. If unexpected error happens, print to the exception but am I doing it correctly?

Hopefully I can get new knowledge from here :)

asked Apr 12, 2021 at 8:09
\$\endgroup\$
2
  • \$\begingroup\$ I would like to know where it goes wrong - does it complete without error? \$\endgroup\$ Commented Apr 12, 2021 at 14:59
  • 1
    \$\begingroup\$ @Reinderien right now the script works without any errors yes, but what if etc JSON is incorrect or unexpected errors happens, thats what I meant "where it goes wrong" \$\endgroup\$ Commented Apr 12, 2021 at 15:28

2 Answers 2

3
\$\begingroup\$
  1. You need functions.
    Functions help to split up your code and make understanding how everything fits together easier.

    You should have a function for:

    1. Getting the page.

    2. Parsing the JSON with your helpful error message.

    3. Getting the payload.

    4. A main function for calling all the other functions.

      We should only call the main function if the script is the entry point. We can do so with the following code:

      if __name__ == "__main__":
       main()
      
  2. You should add a function to walk paths.

    product_price = doc.get('markets', {}).get('6', {}).get('pricesByPricelist', {}).get('3', {}).get('price', {})
    
    # Added exception incase the [0] is not able to get any value from 'media'
    try:
     product_image = doc.get('media', {})[0].get('sources', {}).get('full', {}).get('url', {})
    except KeyError:
     product_image = None
    

    Things would be so much simpler if we could just do:

    price = get(data, "markets", "6", "pricesByPricelist", "3", "price")
    image = get(data, "media", 0, "sources", "full", "url")
    

    To build get is simple.

    1. We take an object, a path and a default value.
    2. For each segment of the path we index the object.
    3. If the object doesn't have the value (__getitem__ raises LookupError) we return the default value.
    4. The default value should be to raise an error.
      We can ensure anything the user passes to the function is returned. By making a new instance of an object we can then check if the default is the instance.
    5. For convenience we can build the get function which defaults to None.
    _SENTINEL = object()
    def walk(obj, path, default=_SENTINEL):
     try:
     for segment in path:
     obj = obj[segment]
     return obj
     except LookupError:
     if default is _SENTINEL:
     raise LookupError(f"couldn't walk path; {path}") from None
     else:
     return default
    def get(obj, *path):
     return walk(obj, path, default=None)
    
  3. I'd highly recommend using exceptions for control flow in Python.

    We can subclass BaseException to make an exception that prints to the user and then exits the system. By wrapping the main() call in a try we can check if our custom error is raised print and then exit with a status of 1.

    try:
     main()
    except PrettyExitException as e:
     print(e)
     raise SystemExit(1) from None
    
import json
from json import JSONDecodeError
import requests
_SENTINEL = object()
class PrettyExitException(BaseException):
 pass
def get_page():
 headers = {'origin': 'https://www.aplace.com', 'Content-Type': 'application/json'}
 with requests.get("http://api.aplace.com/sv/dam/aaltoili-iso-mehu-gren-red-sand", headers=headers) as response:
 if not response.ok:
 raise PrettyExitException(f"Requests NOT ok -> {payload}")
 return response.text
def parse_json(raw_json):
 try:
 return json.loads(raw_json)
 except (JSONDecodeError, KeyError, ValueError) as err:
 raise PrettyExitException(f"Not able to parse JSON -> {err} -> Payload: {payload}") from None
def walk(obj, path, default=_SENTINEL):
 try:
 for segment in path:
 obj = obj[segment]
 return obj
 except LookupError:
 if default is _SENTINEL:
 raise LookupError(f"couldn't walk path; {path}") from None
 else:
 return default
def get(obj, *path):
 return walk(obj, path, default=None)
def build_payload(data):
 name = get(data, "name")
 brand = get(data, "brandName")
 price = get(data, "markets", "6", "pricesByPricelist", "3", "price")
 image = get(data, "media", 0, "sources", "full", "url")
 size = get(data, "items")
 count = 0
 sizes = []
 sizes_with_stock = []
 if name is not None and brand is not None:
 name = f"{brand} {name}"
 else:
 name = None
 if price is not None:
 price = price.replace("\xa0", "")
 if size is not None:
 for value in size.values():
 stock = get(value, "stockByMarket", "6")
 if stock is not None:
 _name = get(value, "name")
 count += stock
 sizes.append(_name)
 sizes_with_stock.append(f"{_name} - ({stock})")
 return {
 "name": name,
 "price": price,
 "image": image,
 "sizes": sizes,
 "sizesWithStock": sizes_with_stock,
 "stock": count,
 }
def main():
 raw_data = get_page()
 data = parse_json(raw_data.replace(",,", ","))
 payload = build_payload(data.get("productData", {}))
 print(f"Finished payload -> {payload}")
if __name__ == "__main__":
 try:
 main()
 except PrettyExitException as e:
 print(e)
 raise SystemExit(1) from None
answered Apr 12, 2021 at 16:46
\$\endgroup\$
7
  • \$\begingroup\$ Hi! Thank you for the awesome review! I have currently tested your code and it seems like everything returns None aswell as UnboundLocalError: local variable 'stock' referenced before assignment when running it :'( \$\endgroup\$ Commented Apr 12, 2021 at 19:07
  • 1
    \$\begingroup\$ @ProtractorNewbie Oh sorry I made some typos! The code should be good now. \$\endgroup\$ Commented Apr 12, 2021 at 19:09
  • \$\begingroup\$ No problem! There is also another small typo in name = get(value, "name") as you already using it at the top :) But another question, wouldn't it be ok to just use if name and brand because name doesnt have a value, then it wont go through the if statement without needing to add the is None? \$\endgroup\$ Commented Apr 12, 2021 at 19:11
  • 1
    \$\begingroup\$ @ProtractorNewbie Darn it, you're correct again! Sorry. Yes you can use if name and brand. I'm unsure what your data looks like so I chose to play things safe and assume any valid string is a valid name/brand. Yes if you think the API returning an empty name should be treated as no value please change the line to whatever works best for you. \$\endgroup\$ Commented Apr 12, 2021 at 19:16
  • \$\begingroup\$ Im thinking since if we cannot get the name from your get function, then it will return None either way, thats what I was thinking. So either it finds it or it will return None and by that we could easily just if name... in that case, unless im missing something in between your walk function :) But I really love the function, looks so much better for sure! \$\endgroup\$ Commented Apr 12, 2021 at 19:18
2
\$\begingroup\$
  • You need to get rid of your "payload" antipattern. The dictionary you're constructing isn't particularly a payload; it's just a grab bag of untyped, unorganized data. If not OOP, then at least put this in normal variables; but avoid a dictionary unless you're actually doing something that requires a dictionary - e.g. saving to a JSON file, or sending to a secondary web service.
  • Write a generic function that accepts a product name, rather than just having a pile of global code
  • Consider adding PEP484 type hints
  • Use response.raise_for_status instead of checking ok yourself, particularly since you care about proper exception handling
  • Your global comma replacement is not safe. My guess is that instead of They added extra comma to fail the JSON, they just made a mistake - never attribute to malice that which you can attribute to ignorance. Attempt to replace this more narrowly.
  • You seem to want to use .get('x', {}) as a magic bullet without fully understanding what it does, because you've used it in some places where it's not at all appropriate, such as .get('price', {}). You do not want to default a price to a dictionary.
  • Consider wrapping exceptions in your own type and catching that at the top level
  • Where possible, either put magic constants such as 6 in a constant, or avoid it by searching for "what you actually want"; e.g. a specific currency string.

Suggested

import json
from decimal import Decimal
from json import JSONDecodeError
from pprint import pformat
from typing import Dict, Any, Optional
import requests
MARKET_ID = '6'
CURRENCY = 'SEK'
class APlaceError(Exception):
 pass
def get_api(product: str) -> Dict[str, Any]:
 headers = {
 'Origin': 'https://www.aplace.com',
 'Accept': 'application/json',
 }
 try:
 with requests.get(f'http://api.aplace.com/sv/dam/{product}', headers=headers) as response:
 response.raise_for_status()
 payload = response.text.replace(',,"cached"', ',"cached"')
 doc = json.loads(payload)
 except IOError as err:
 raise APlaceError('Unable to get response from APlace server') from err
 except JSONDecodeError as err:
 raise APlaceError('Unable to parse JSON') from err
 if doc.get('content', {}).get('is404'):
 raise APlaceError(f'Product "{product}" not found')
 return doc
class Product:
 def __init__(self, doc: Dict[str, Any]):
 try:
 data = doc['productData']
 except KeyError as e:
 raise APlaceError('Empty response') from e
 product = data.get('product')
 if product is None:
 self.id: Optional[int] = None
 else:
 self.id = int(product)
 self.uri: Optional[str] = data.get('uri')
 self.name: Optional[str] = data.get('name')
 self.brand_name: Optional[str] = data.get('brandName')
 self.sku: Optional[str] = data.get('sku')
 # etc.
 try:
 prices = next(
 price
 for price in data.get('markets', {})
 .get(MARKET_ID, {})
 .get('pricesByPricelist', {})
 .values()
 if price.get('price', '').endswith(CURRENCY)
 )
 price = prices.get('priceAsNumber')
 if price is None:
 self.price: Optional[float] = None
 else:
 self.price = Decimal(price)
 self.price_with_currency: Optional[str] = prices.get('price')
 except StopIteration:
 self.price = None
 self.price_with_currency = None
 # Added exception incase the [0] is not able to get any value from 'media'
 try:
 self.image_url: Optional[str] = next(
 medium.get('sources', {}).get('full', {}).get('url')
 for medium in data.get('media', [])
 )
 except StopIteration:
 self.image_url = None
 self.stock_by_size = {
 item.get('name'): item.get('stockByMarket').get(MARKET_ID, 0)
 for item in data.get('items', {}).values()
 }
 def __str__(self):
 return (
 f'Brand: {self.brand_name}\n'
 f'Name: {self.name}\n'
 f'Price: {self.price_with_currency}\n'
 f'Image URL: {self.image_url}\n'
 f'Stock for each size: {pformat(self.stock_by_size)}\n'
 )
def main():
 try:
 prod = Product(get_api('aaltoili-iso-mehu-gren-red-sand'))
 print(prod)
 except APlaceError as e:
 print(f'Failed to query product: {e}')
if __name__ == '__main__':
 main()
answered Apr 12, 2021 at 18:16
\$\endgroup\$
6
  • \$\begingroup\$ Hi again! You have been amazing and I really wished honestly that you had a donation that I could donate too! You been really awesome to me and I have been learning so much from you! Honestly! \$\endgroup\$ Commented Apr 12, 2021 at 19:23
  • \$\begingroup\$ The reason im using the payload is that I want to return as a dict, but I could also change it. The plan was to have a prefix of a dict where I could add stuff to the payload and then replace by using payload["name"] as I did but I could also do something like {"name": self.name or _prefix.name_ sort of. I think as you say, I should clear out the payload. But the reason also is that when we raise an error or that the response status code is not OK, I want to return atleast some information to the prefixed payload, if that make sense? \$\endgroup\$ Commented Apr 12, 2021 at 19:25
  • 1
    \$\begingroup\$ I'm glad you appreciate the review :) As to your "at least some information", the desire is good but the execution is off. It's entirely possible to populate a class with only some members; hence the Optional notation that accepts None if a value is absent. \$\endgroup\$ Commented Apr 12, 2021 at 19:30
  • \$\begingroup\$ Indeed! I just thought about for example, lets say that we have a page that is found but the API is not updated fully, meaning example that the value data = doc['productData'] is not in placed in the API yet. So what I thought was that in that exception of it, to return a payload which I could for example strip and split the URL to get end of the URL which usually is the name of a product. then I could know atleast that there is a product with name as a little information to know... - Does that make some sense ? :D \$\endgroup\$ Commented Apr 12, 2021 at 19:35
  • 1
    \$\begingroup\$ Kind of? If all you have is the product URI and everything else fails, then you can just print the product URI. No need for stripping and splitting. \$\endgroup\$ Commented Apr 12, 2021 at 19:48

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.