3
\$\begingroup\$

The Microsoft OCR API returns json, and if I want to extact the text data from this json:

response = \
{
 "language": "en",
 "textAngle": -2.0000000000000338,
 "orientation": "Up",
 "regions": [
 {
 "boundingBox": "462,379,497,258",
 "lines": [
 {
 "boundingBox": "462,379,497,74",
 "words": [
 {
 "boundingBox": "462,379,41,73",
 "text": "A"
 },
 {
 "boundingBox": "523,379,153,73",
 "text": "GOAL"
 },
 {
 "boundingBox": "694,379,265,74",
 "text": "WITHOUT"
 }
 ]
 },
 {
 "boundingBox": "565,471,289,74",
 "words": [
 {
 "boundingBox": "565,471,41,73",
 "text": "A"
 },
 {
 "boundingBox": "626,471,150,73",
 "text": "PLAN"
 },
 {
 "boundingBox": "801,472,53,73",
 "text": "IS"
 }
 ]
 },
 {
 "boundingBox": "519,563,375,74",
 "words": [
 {
 "boundingBox": "519,563,149,74",
 "text": "JUST"
 },
 {
 "boundingBox": "683,564,41,72",
 "text": "A"
 },
 {
 "boundingBox": "741,564,153,73",
 "text": "WISH"
 }
 ]
 }
 ]
 }
 ]
}
def check_for_word(ocr):
 # Initialise our subject to None
 print("OCR: {}".format(ocr))
 subject = None
 for region in ocr["regions"]:
 if "lines" in region:
 for lines in region["lines"]:
 if "words" in lines:
 for word in lines["words"]:
 if "text" in word:
 subject = word["text"].lower()
 break
 print("OCR word is {}".format(subject))
 return subject
print(response["regions"][0]["lines"][0]["words"][0]["text"]) # Should return this
print(check_for_word(response))
  • Each dictionary has arrays and we are unsure if the array contains any element
  • Also not sure if the dictionary has key

Let's say we just wish to return the first text it matched from the image file.

This code works but it has a deep nested structure that has bad smell. Is there a better practice to write this in a cleaner way?

Toby Speight
87.1k14 gold badges104 silver badges322 bronze badges
asked Nov 15, 2018 at 16:19
\$\endgroup\$
0

1 Answer 1

3
\$\begingroup\$

One way to almost halve the number of lines (and levels of indentation) needed is to use dict.get with [] as the optional default option:

def check_for_word(ocr):
 for region in ocr["regions"]:
 for lines in region.get("lines", []):
 for word in lines.get("words", []):
 if "text" in word:
 return word["text"].lower()
 else:
 raise KeyError("OCR word not found")

I would also move the printing outside the function, so you can immediately return and add a else clause to catch the case that it is not present (this part could also be done outside with your code by checking for None).

answered Nov 15, 2018 at 17:15
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.