5
\$\begingroup\$

Background

Chronicling America is an archive of digitized US newspapers hosted by the Library of Congress. The site allows programmatic access of its resources through an API, one part of which is JSON representations of each newspaper. Newspapers are composed of issues (denoted by date and edition number), and issues are composed of pages, which have several representations. I am interested in the ocr.txt representation.

Each newspaper is represented in the API by URL-linked JSON representations:

newspaper.json -> issue.json -> page.json -> ocr.txt`

Example URL sequence to arrive at a single txt page of a newspaper issue:

http://chroniclingamerica.loc.gov/lccn/sn84026994.json
http://chroniclingamerica.loc.gov/lccn/sn84026994/1865-08-21/ed-1.json
http://chroniclingamerica.loc.gov/lccn/sn84026994/1865-08-21/ed-1/seq-1.json
http://chroniclingamerica.loc.gov/lccn/sn84026994/1865-08-21/ed-1/seq-1/ocr.txt

where

  • sn84026994 = newspaper lccn
  • 1865年08月21日/ed-1 = issue
  • seq-1 = page
  • ocr.txt = plain text representation of page

Design

This Python module is designed to traverse the url links for a given newspaper and range of issues (denoted by the date of issue).

The two functions designed to be imported and interact with calling code are disp_newspaper() and dwnld_newspaper():

  1. disp_newspaper() retrieves some data about the newspaper and prints it on the terminal to assist the user in deciding on arguments for dwnld_newspaper()
  2. dwnld_newspaper() downloads and assembles the ocr.txt files for each desired issue of a given newspaper into a dict {'date': 'text'}

Questions

As I am somewhat of a beginner in coding systems:

  1. Is this code easy to understand? If not, how can it be refactored?
  2. Does it follow style guides?
  3. Have I missed a simple strategy for traversing the API endpoints and obtaining the end result of a single string of all .txt files for a given newspaper and issue?

Callable Code

import json
import os
from datetime import datetime
from urllib.request import Request, urlopen
from urllib.error import URLError
def disp_newspaper(url):
 """Displays information and issues available for a given newspaper
 Parameters: url -> url of JSON file for newspaper: str
 Returns: newspaper_json -> dict representation of JSON from http
 request: dict"""
 try:
 newspaper_json = get_json(url)
 except ValueError as e:
 return e
 newspaper_string = ('{} | Library of Congress No.: {} | {}\nPublished '
 'from {} to {} by {}').format(
 newspaper_json['name'],
 newspaper_json['lccn'],
 newspaper_json['place_of_publication'],
 newspaper_json['start_year'],
 newspaper_json['end_year'],
 newspaper_json['publisher'])
 issues_string = ('Number of Issues Downloadable: {}\nFirst issue: {}\n'
 'Last Issue: {}\n').format(
 len(newspaper_json['issues']),
 newspaper_json['issues'][0]['date_issued'],
 newspaper_json['issues'][-1]['date_issued'])
 print(newspaper_string)
 print('\n', end='')
 print(issues_string)
def dwnld_newspaper(url, start_date, end_date):
 """Downloads OCR text of a newspaper from chroniclingamerica.loc.gov by
 parsing the .json representation using the exposed API. Traverses
 the json from the newspaper .json url to each page and composes them into
 a dict of issues where {'date': 'issue text'}
 Params: url -> str: base url of newspaper. Ends in .json
 start_date -> date: date(year, month, day)
 represents the first issue to download
 end_date -> date: date(year, month, day)
 represents the last issue to download
 Return: newspaper_issues -> dict: {'date': 'issue text'}"""
 newspaper_issues = {}
 # Terminal UI Print statements
 print('start date:', start_date)
 print('end date:', end_date)
 # Interface
 print('Getting issues:')
 # TODO: handle more than 2 issue editions for same date
 try:
 for issue in get_json(url)['issues']:
 if (parse_date(issue['date_issued']) >= start_date and
 parse_date(issue['date_issued']) <= end_date):
 # Check for multiple issues with same date
 if issue['date_issued'] not in newspaper_issues:
 print(issue['date_issued'])
 newspaper_issues[issue['date_issued']] = \
 assemble_issue(issue['url'])
 # append to differentiate second edition of same date
 else:
 print(issue['date_issued'] + '-ed-2')
 newspaper_issues[issue['date_issued'] + '-ed-2'] = \
 assemble_issue(issue['url'])
 return newspaper_issues # dict {'date_issued': 'alltextforallpages'}
 except ValueError as e:
 return e

Supporting Functions

The above two functions rely on other functions which are designed to modularize the code.

def validate_chronam_url(url):
 """Naive check. Ensures that the url goes to a
 chroniclingamerica.loc.gov newspaper
 and references the .json representation
 Params: url -> url of JSON file for newspaper to download: str
 Return: Boolean"""
 domain_chk = 'chroniclingamerica.loc.gov/lccn/sn'
 json_chk = '.json'
 if domain_chk in url and json_chk in url:
 return True
 else:
 return False
def get_json(url):
 """Downloads json from url from chronliclingamerica.loc.gov 
 and saves as a Python dict.
 Parameters: url -> url of JSON file for newspaper 
 to download: str
 Returns: json_dict -> dict representation of 
 JSON from http request: dict"""
 r = Request(url)
 # Catch non-chronam urls
 if validate_chronam_url(url) is not True:
 raise ValueError('Invalid url for chroniclingamerica.loc.gov'
 'OCR newspaper (url must end in .json)')
 try:
 data = urlopen(r)
 except URLError as e:
 if hasattr(e, 'reason'):
 print('We failed to reach a server.')
 print('Reason: ', e.reason)
 print('url: ', url)
 elif hasattr(e, 'code'):
 print('The server couldn\'t fulfill the request.')
 print('Error code: ', e.code)
 print('url: ', url)
 else:
 # read().decode('utf-8') is necessary for Python 3.4
 json_dict = json.loads(data.read(), encoding='utf-8')
 return json_dict
def get_txt(url):
 """Downloads txt from url from chroniclingamerica.loc.gov 
 and saves as python str.
 Relies on valid url supplied by get_json()
 Parameters: url -> url for OCR text returned by get_json(): str
 Returns: retrieved_txt -> OCR text: str"""
 # TODO: return lists of missing & failed pages
 missing_pages = []
 failed_pages = []
 r = Request(url)
 try:
 data = urlopen(r)
 except URLError as e:
 if hasattr(e, 'reason'):
 print('We failed to reach a server.')
 print('Reason: ', e.reason)
 print('url: ', url)
 retrieved_txt = ('Likely Missing Page: Not digitized,'
 'published')
 missing_pages.append(url)
 elif hasattr(e, 'code'):
 print('The server couldn\'t fulfill the request.')
 print('Error code: ', e.code)
 print('url: ', url)
 retrieved_txt = 'Server didn\'t return any text'
 failed_pages.append(url)
 else:
 retrieved_txt = data.read().decode('utf-8')
 return retrieved_txt
def dwnld_page(url): # url of page
 """Downloads the OCR text of a newspaper page. Relies on valid 
 url from assemble_issue()
 Params: url -> url of OCR text of page: str
 Return: txt -> OCR text of a newspaper page: str"""
 txt_url = get_json(url)['text']
 txt = get_txt(txt_url)
 return txt
def assemble_issue(url): # url of issue
 """Assembles the OCR text for each page of a newspaper.
 Relies on valid url from dwnld_newspaper()
 Params: url -> url of newspaper issue: str
 Return: txt -> OCR text of all pages in newspaper: str"""
 issue_string = ''
 for page in get_json(url)['pages']:
 issue_string += dwnld_page(page['url'])
 return issue_string # str 'alltextforallpages'
def parse_date(datestring):
 """Converts YYYY-MM-DD string into date object
 Params: date -> str: 'YYYY-MM-DD'
 Return: return_date -> date"""
 date_fmt_str = '%Y-%m-%d'
 return_date = datetime.strptime(datestring, date_fmt_str).date()
 return return_date
def lccn_to_disk(dir_name, downloaded_issue):
 """Saves a dict of downloaded issues to disk. Creates a directory:
 dir_name
 |--key1.txt
 |--key2.txt
 +--key3.txt
 Params: dir_name -> str: name of created directory for data
 downloaded_issue -> dict: {'YYYY-MM-DD': 'string 
 of txt'}"""
 if not os.path.exists(dir_name):
 os.makedirs(dir_name)
 for date, text in downloaded_issue.items():
 with open(os.path.join(dir_name, date + '.txt'), 'w') as f:
 f.write(text)
 return
Toby Speight
87.1k14 gold badges104 silver badges322 bronze badges
asked Nov 14, 2017 at 17:56
\$\endgroup\$
0

1 Answer 1

3
\$\begingroup\$

Some of the things that may make your code better:

  • you can pass the whole newspaper object into the format string and use a multi-line string:

    newspaper_string = """
     {newspaper.name} | Library of Congress No.: {newspaper.lccn} | {newspaper.place_of_publication}
     Published from {newspaper.start_year} to {newspaper.end_year} by {newspaper.publisher}
    """.format(newspaper=newspaper_json)
    

    Or, unpack the dictionary into keyword arguments:

    newspaper_string = """
     {name} | Library of Congress No.: {lccn} | {place_of_publication}
     Published from {start_year} to {end_year} by {publisher}
    """.format(**newspaper_json)
    

    Same goes for the issues_string string definition.

  • you can further improve that by using f-strings (Python 3.6+)

  • there is not a lot of reasoning behind not using the full words instead of abbreviated for the function names - for instance, download_newspaper instead of dwnld_newspaper
  • the parse_date() is called twice per single iteration of the for issue in get_json(url)['issues'] loop. You can do it once and save the result into a variable
  • validate_chronam_url body may be simplified and improved to:

    domain_matches = 'chroniclingamerica.loc.gov/lccn/sn' in url
    is_json = '.json' in url
    return domain_matches and is_json
    
  • you can use json.load(data, ...) instead of json.loads(data.read(), ...)

  • you can use str.join() for the assemble_issue function:

    return ''.join(dwnld_page(page['url']) for page in get_json(url)['pages'])
    
  • you don't need a return in the lccn_to_disk function

Some other thoughts:

  • not sure that returning an exception instance in case things go wrong is a good idea. Either let them fail and propagate or handle - e.g. log the error and exit
  • consider using requests which would not only simplify the JSON response parsing part (there is a response.json() built-in method), but would also make the subsequent requests faster if you would use a single session instance
  • it feels like having a Newspaper class might be a good idea. At least you would have a nice way to hide the string representation of a newspaper representation string inside the __repr__ magic method
  • if performance is a concern, aside from requests.Session, you may improve on JSON parsing by switching to ujson parser
answered Nov 15, 2017 at 1:36
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.