I have been working on where I want to improve my knowledge to do a safe JSON scraping. By that is that I could pretty much do a ugly workaround and wrap it into a try-except but that is not the case. I would like to know where it goes wrong and how I can improve the code I have been working with:
import json
from json import JSONDecodeError
import requests
payload = {
"name": None,
"price": None,
"sizes": None
}
headers = {'origin': 'https://www.aplace.com', 'Content-Type': 'application/json'}
with requests.get("http://api.aplace.com/sv/dam/aaltoili-iso-mehu-gren-red-sand", headers=headers) as response:
if not response.ok:
print(f"Requests NOT ok -> {payload}")
try:
# They added extra comma to fail the JSON. Fixed it by removing duplicate commas
doc = json.loads(response.text.replace(",,", ",")).get("productData", {})
except (JSONDecodeError, KeyError, ValueError) as err:
print(f"Not able to parse json -> {err} -> Payload: {payload}")
raise
if doc:
product_name = doc.get('name', {})
product_brand = doc.get('brandName', {})
product_price = doc.get('markets', {}).get('6', {}).get('pricesByPricelist', {}).get('3', {}).get('price', {})
# Added exception incase the [0] is not able to get any value from 'media'
try:
product_image = doc.get('media', {})[0].get('sources', {}).get('full', {}).get('url', {})
except KeyError:
product_image = None
product_size = doc.get('items', {})
try:
if product_name and product_brand:
payload['name'] = f"{product_brand} {product_name}"
if product_price:
payload['price'] = product_price.replace('\xa0', '')
if product_image:
payload['image'] = product_image
if product_size:
count = 0
sizes = []
sizesWithStock = []
for value in product_size.values():
if value.get('stockByMarket', {}).get('6', {}):
count += value['stockByMarket']['6']
sizes.append(value['name'])
sizesWithStock.append(f"{value['name']} - ({value['stockByMarket']['6']})")
payload['sizes'] = sizes
payload['sizesWithStock'] = sizesWithStock
payload['stock'] = count
except Exception as err:
print("Create notification if we hit here later on")
raise
print(f"Finished payload -> {payload}")
My goal here is to cover all the values and if we do not find it inside the json then we use the if statements to see if we have caught a value or not.
Things that annoys me:
- Long nested dicts but im not sure if there is any other way, is there a better way?
- try-except for product_image, because its a list and I just want to grab the first in the list, is it possible to skip the try-except?
- If unexpected error happens, print to the exception but am I doing it correctly?
Hopefully I can get new knowledge from here :)
-
\$\begingroup\$ I would like to know where it goes wrong - does it complete without error? \$\endgroup\$Reinderien– Reinderien2021年04月12日 14:59:48 +00:00Commented Apr 12, 2021 at 14:59
-
1\$\begingroup\$ @Reinderien right now the script works without any errors yes, but what if etc JSON is incorrect or unexpected errors happens, thats what I meant "where it goes wrong" \$\endgroup\$PythonNewbie– PythonNewbie2021年04月12日 15:28:01 +00:00Commented Apr 12, 2021 at 15:28
2 Answers 2
You need functions.
Functions help to split up your code and make understanding how everything fits together easier.You should have a function for:
Getting the page.
Parsing the JSON with your helpful error message.
Getting the payload.
A
main
function for calling all the other functions.We should only call the main function if the script is the entry point. We can do so with the following code:
if __name__ == "__main__": main()
You should add a function to walk paths.
product_price = doc.get('markets', {}).get('6', {}).get('pricesByPricelist', {}).get('3', {}).get('price', {})
# Added exception incase the [0] is not able to get any value from 'media' try: product_image = doc.get('media', {})[0].get('sources', {}).get('full', {}).get('url', {}) except KeyError: product_image = None
Things would be so much simpler if we could just do:
price = get(data, "markets", "6", "pricesByPricelist", "3", "price") image = get(data, "media", 0, "sources", "full", "url")
To build get is simple.
- We take an object, a path and a default value.
- For each segment of the path we index the object.
- If the object doesn't have the value (
__getitem__
raisesLookupError
) we return the default value. - The default value should be to raise an error.
We can ensure anything the user passes to the function is returned. By making a new instance of an object we can then check if the defaultis
the instance. - For convenience we can build the
get
function which defaults toNone
.
_SENTINEL = object() def walk(obj, path, default=_SENTINEL): try: for segment in path: obj = obj[segment] return obj except LookupError: if default is _SENTINEL: raise LookupError(f"couldn't walk path; {path}") from None else: return default def get(obj, *path): return walk(obj, path, default=None)
I'd highly recommend using exceptions for control flow in Python.
We can subclass
BaseException
to make an exception that prints to the user and then exits the system. By wrapping themain()
call in atry
we can check if our custom error is raised print and then exit with a status of 1.try: main() except PrettyExitException as e: print(e) raise SystemExit(1) from None
import json
from json import JSONDecodeError
import requests
_SENTINEL = object()
class PrettyExitException(BaseException):
pass
def get_page():
headers = {'origin': 'https://www.aplace.com', 'Content-Type': 'application/json'}
with requests.get("http://api.aplace.com/sv/dam/aaltoili-iso-mehu-gren-red-sand", headers=headers) as response:
if not response.ok:
raise PrettyExitException(f"Requests NOT ok -> {payload}")
return response.text
def parse_json(raw_json):
try:
return json.loads(raw_json)
except (JSONDecodeError, KeyError, ValueError) as err:
raise PrettyExitException(f"Not able to parse JSON -> {err} -> Payload: {payload}") from None
def walk(obj, path, default=_SENTINEL):
try:
for segment in path:
obj = obj[segment]
return obj
except LookupError:
if default is _SENTINEL:
raise LookupError(f"couldn't walk path; {path}") from None
else:
return default
def get(obj, *path):
return walk(obj, path, default=None)
def build_payload(data):
name = get(data, "name")
brand = get(data, "brandName")
price = get(data, "markets", "6", "pricesByPricelist", "3", "price")
image = get(data, "media", 0, "sources", "full", "url")
size = get(data, "items")
count = 0
sizes = []
sizes_with_stock = []
if name is not None and brand is not None:
name = f"{brand} {name}"
else:
name = None
if price is not None:
price = price.replace("\xa0", "")
if size is not None:
for value in size.values():
stock = get(value, "stockByMarket", "6")
if stock is not None:
_name = get(value, "name")
count += stock
sizes.append(_name)
sizes_with_stock.append(f"{_name} - ({stock})")
return {
"name": name,
"price": price,
"image": image,
"sizes": sizes,
"sizesWithStock": sizes_with_stock,
"stock": count,
}
def main():
raw_data = get_page()
data = parse_json(raw_data.replace(",,", ","))
payload = build_payload(data.get("productData", {}))
print(f"Finished payload -> {payload}")
if __name__ == "__main__":
try:
main()
except PrettyExitException as e:
print(e)
raise SystemExit(1) from None
-
\$\begingroup\$ Hi! Thank you for the awesome review! I have currently tested your code and it seems like everything returns None aswell as
UnboundLocalError: local variable 'stock' referenced before assignment
when running it :'( \$\endgroup\$PythonNewbie– PythonNewbie2021年04月12日 19:07:41 +00:00Commented Apr 12, 2021 at 19:07 -
1\$\begingroup\$ @ProtractorNewbie Oh sorry I made some typos! The code should be good now. \$\endgroup\$2021年04月12日 19:09:59 +00:00Commented Apr 12, 2021 at 19:09
-
\$\begingroup\$ No problem! There is also another small typo in
name = get(value, "name")
as you already using it at the top :) But another question, wouldn't it be ok to just useif name and brand
becausename
doesnt have a value, then it wont go through the if statement without needing to add theis None
? \$\endgroup\$PythonNewbie– PythonNewbie2021年04月12日 19:11:47 +00:00Commented Apr 12, 2021 at 19:11 -
1\$\begingroup\$ @ProtractorNewbie Darn it, you're correct again! Sorry. Yes you can use
if name and brand
. I'm unsure what your data looks like so I chose to play things safe and assume any valid string is a valid name/brand. Yes if you think the API returning an empty name should be treated as no value please change the line to whatever works best for you. \$\endgroup\$2021年04月12日 19:16:05 +00:00Commented Apr 12, 2021 at 19:16 -
\$\begingroup\$ Im thinking since if we cannot get the name from your
get
function, then it will return None either way, thats what I was thinking. So either it finds it or it will return None and by that we could easily justif name...
in that case, unless im missing something in between your walk function :) But I really love the function, looks so much better for sure! \$\endgroup\$PythonNewbie– PythonNewbie2021年04月12日 19:18:30 +00:00Commented Apr 12, 2021 at 19:18
- You need to get rid of your "payload" antipattern. The dictionary you're constructing isn't particularly a payload; it's just a grab bag of untyped, unorganized data. If not OOP, then at least put this in normal variables; but avoid a dictionary unless you're actually doing something that requires a dictionary - e.g. saving to a JSON file, or sending to a secondary web service.
- Write a generic function that accepts a product name, rather than just having a pile of global code
- Consider adding PEP484 type hints
- Use
response.raise_for_status
instead of checkingok
yourself, particularly since you care about proper exception handling - Your global comma replacement is not safe. My guess is that instead of They added extra comma to fail the JSON, they just made a mistake - never attribute to malice that which you can attribute to ignorance. Attempt to replace this more narrowly.
- You seem to want to use
.get('x', {})
as a magic bullet without fully understanding what it does, because you've used it in some places where it's not at all appropriate, such as.get('price', {})
. You do not want to default a price to a dictionary. - Consider wrapping exceptions in your own type and catching that at the top level
- Where possible, either put magic constants such as
6
in a constant, or avoid it by searching for "what you actually want"; e.g. a specific currency string.
Suggested
import json
from decimal import Decimal
from json import JSONDecodeError
from pprint import pformat
from typing import Dict, Any, Optional
import requests
MARKET_ID = '6'
CURRENCY = 'SEK'
class APlaceError(Exception):
pass
def get_api(product: str) -> Dict[str, Any]:
headers = {
'Origin': 'https://www.aplace.com',
'Accept': 'application/json',
}
try:
with requests.get(f'http://api.aplace.com/sv/dam/{product}', headers=headers) as response:
response.raise_for_status()
payload = response.text.replace(',,"cached"', ',"cached"')
doc = json.loads(payload)
except IOError as err:
raise APlaceError('Unable to get response from APlace server') from err
except JSONDecodeError as err:
raise APlaceError('Unable to parse JSON') from err
if doc.get('content', {}).get('is404'):
raise APlaceError(f'Product "{product}" not found')
return doc
class Product:
def __init__(self, doc: Dict[str, Any]):
try:
data = doc['productData']
except KeyError as e:
raise APlaceError('Empty response') from e
product = data.get('product')
if product is None:
self.id: Optional[int] = None
else:
self.id = int(product)
self.uri: Optional[str] = data.get('uri')
self.name: Optional[str] = data.get('name')
self.brand_name: Optional[str] = data.get('brandName')
self.sku: Optional[str] = data.get('sku')
# etc.
try:
prices = next(
price
for price in data.get('markets', {})
.get(MARKET_ID, {})
.get('pricesByPricelist', {})
.values()
if price.get('price', '').endswith(CURRENCY)
)
price = prices.get('priceAsNumber')
if price is None:
self.price: Optional[float] = None
else:
self.price = Decimal(price)
self.price_with_currency: Optional[str] = prices.get('price')
except StopIteration:
self.price = None
self.price_with_currency = None
# Added exception incase the [0] is not able to get any value from 'media'
try:
self.image_url: Optional[str] = next(
medium.get('sources', {}).get('full', {}).get('url')
for medium in data.get('media', [])
)
except StopIteration:
self.image_url = None
self.stock_by_size = {
item.get('name'): item.get('stockByMarket').get(MARKET_ID, 0)
for item in data.get('items', {}).values()
}
def __str__(self):
return (
f'Brand: {self.brand_name}\n'
f'Name: {self.name}\n'
f'Price: {self.price_with_currency}\n'
f'Image URL: {self.image_url}\n'
f'Stock for each size: {pformat(self.stock_by_size)}\n'
)
def main():
try:
prod = Product(get_api('aaltoili-iso-mehu-gren-red-sand'))
print(prod)
except APlaceError as e:
print(f'Failed to query product: {e}')
if __name__ == '__main__':
main()
-
\$\begingroup\$ Hi again! You have been amazing and I really wished honestly that you had a donation that I could donate too! You been really awesome to me and I have been learning so much from you! Honestly! \$\endgroup\$PythonNewbie– PythonNewbie2021年04月12日 19:23:28 +00:00Commented Apr 12, 2021 at 19:23
-
\$\begingroup\$ The reason im using the payload is that I want to return as a dict, but I could also change it. The plan was to have a prefix of a dict where I could add stuff to the payload and then replace by using
payload["name"]
as I did but I could also do something like{"name": self.name or _prefix.name_
sort of. I think as you say, I should clear out the payload. But the reason also is that when we raise an error or that the response status code is not OK, I want to return atleast some information to the prefixed payload, if that make sense? \$\endgroup\$PythonNewbie– PythonNewbie2021年04月12日 19:25:58 +00:00Commented Apr 12, 2021 at 19:25 -
1\$\begingroup\$ I'm glad you appreciate the review :) As to your "at least some information", the desire is good but the execution is off. It's entirely possible to populate a class with only some members; hence the
Optional
notation that acceptsNone
if a value is absent. \$\endgroup\$Reinderien– Reinderien2021年04月12日 19:30:32 +00:00Commented Apr 12, 2021 at 19:30 -
\$\begingroup\$ Indeed! I just thought about for example, lets say that we have a page that is found but the API is not updated fully, meaning example that the value
data = doc['productData']
is not in placed in the API yet. So what I thought was that in that exception of it, to return a payload which I could for example strip and split the URL to get end of the URL which usually is the name of a product. then I could know atleast that there is a product with name as alittle information to know...
- Does that make some sense ? :D \$\endgroup\$PythonNewbie– PythonNewbie2021年04月12日 19:35:44 +00:00Commented Apr 12, 2021 at 19:35 -
1\$\begingroup\$ Kind of? If all you have is the product URI and everything else fails, then you can just print the product URI. No need for stripping and splitting. \$\endgroup\$Reinderien– Reinderien2021年04月12日 19:48:14 +00:00Commented Apr 12, 2021 at 19:48