Download and concatenate text files from multiple API endpoints

Question 1

Background

Chronicling America is an archive of digitized US newspapers hosted by the Library of Congress. The site allows programmatic access of its resources through an API, one part of which is JSON representations of each newspaper. Newspapers are composed of issues (denoted by date and edition number), and issues are composed of pages, which have several representations. I am interested in the ocr.txt representation.

Each newspaper is represented in the API by URL-linked JSON representations:

newspaper.json -> issue.json -> page.json -> ocr.txt`

Example URL sequence to arrive at a single txt page of a newspaper issue:

http://chroniclingamerica.loc.gov/lccn/sn84026994.json
http://chroniclingamerica.loc.gov/lccn/sn84026994/1865-08-21/ed-1.json
http://chroniclingamerica.loc.gov/lccn/sn84026994/1865-08-21/ed-1/seq-1.json
http://chroniclingamerica.loc.gov/lccn/sn84026994/1865-08-21/ed-1/seq-1/ocr.txt

where

sn84026994 = newspaper lccn
1865年08月21日/ed-1 = issue
seq-1 = page
ocr.txt = plain text representation of page

Design

This Python module is designed to traverse the url links for a given newspaper and range of issues (denoted by the date of issue).

The two functions designed to be imported and interact with calling code are disp_newspaper() and dwnld_newspaper():

disp_newspaper() retrieves some data about the newspaper and prints it on the terminal to assist the user in deciding on arguments for dwnld_newspaper()
dwnld_newspaper() downloads and assembles the ocr.txt files for each desired issue of a given newspaper into a dict {'date': 'text'}

Questions

As I am somewhat of a beginner in coding systems:

Is this code easy to understand? If not, how can it be refactored?
Does it follow style guides?
Have I missed a simple strategy for traversing the API endpoints and obtaining the end result of a single string of all .txt files for a given newspaper and issue?

Callable Code

import json
import os
from datetime import datetime
from urllib.request import Request, urlopen
from urllib.error import URLError
def disp_newspaper(url):
 """Displays information and issues available for a given newspaper
 Parameters: url -> url of JSON file for newspaper: str
 Returns: newspaper_json -> dict representation of JSON from http
 request: dict"""
 try:
 newspaper_json = get_json(url)
 except ValueError as e:
 return e
 newspaper_string = ('{} | Library of Congress No.: {} | {}\nPublished '
 'from {} to {} by {}').format(
 newspaper_json['name'],
 newspaper_json['lccn'],
 newspaper_json['place_of_publication'],
 newspaper_json['start_year'],
 newspaper_json['end_year'],
 newspaper_json['publisher'])
 issues_string = ('Number of Issues Downloadable: {}\nFirst issue: {}\n'
 'Last Issue: {}\n').format(
 len(newspaper_json['issues']),
 newspaper_json['issues'][0]['date_issued'],
 newspaper_json['issues'][-1]['date_issued'])
 print(newspaper_string)
 print('\n', end='')
 print(issues_string)
def dwnld_newspaper(url, start_date, end_date):
 """Downloads OCR text of a newspaper from chroniclingamerica.loc.gov by
 parsing the .json representation using the exposed API. Traverses
 the json from the newspaper .json url to each page and composes them into
 a dict of issues where {'date': 'issue text'}
 Params: url -> str: base url of newspaper. Ends in .json
 start_date -> date: date(year, month, day)
 represents the first issue to download
 end_date -> date: date(year, month, day)
 represents the last issue to download
 Return: newspaper_issues -> dict: {'date': 'issue text'}"""
 newspaper_issues = {}
 # Terminal UI Print statements
 print('start date:', start_date)
 print('end date:', end_date)
 # Interface
 print('Getting issues:')
 # TODO: handle more than 2 issue editions for same date
 try:
 for issue in get_json(url)['issues']:
 if (parse_date(issue['date_issued']) >= start_date and
 parse_date(issue['date_issued']) <= end_date):
 # Check for multiple issues with same date
 if issue['date_issued'] not in newspaper_issues:
 print(issue['date_issued'])
 newspaper_issues[issue['date_issued']] = \
 assemble_issue(issue['url'])
 # append to differentiate second edition of same date
 else:
 print(issue['date_issued'] + '-ed-2')
 newspaper_issues[issue['date_issued'] + '-ed-2'] = \
 assemble_issue(issue['url'])
 return newspaper_issues # dict {'date_issued': 'alltextforallpages'}
 except ValueError as e:
 return e

Supporting Functions

The above two functions rely on other functions which are designed to modularize the code.

def validate_chronam_url(url):
 """Naive check. Ensures that the url goes to a
 chroniclingamerica.loc.gov newspaper
 and references the .json representation
 Params: url -> url of JSON file for newspaper to download: str
 Return: Boolean"""
 domain_chk = 'chroniclingamerica.loc.gov/lccn/sn'
 json_chk = '.json'
 if domain_chk in url and json_chk in url:
 return True
 else:
 return False
def get_json(url):
 """Downloads json from url from chronliclingamerica.loc.gov 
 and saves as a Python dict.
 Parameters: url -> url of JSON file for newspaper 
 to download: str
 Returns: json_dict -> dict representation of 
 JSON from http request: dict"""
 r = Request(url)
 # Catch non-chronam urls
 if validate_chronam_url(url) is not True:
 raise ValueError('Invalid url for chroniclingamerica.loc.gov'
 'OCR newspaper (url must end in .json)')
 try:
 data = urlopen(r)
 except URLError as e:
 if hasattr(e, 'reason'):
 print('We failed to reach a server.')
 print('Reason: ', e.reason)
 print('url: ', url)
 elif hasattr(e, 'code'):
 print('The server couldn\'t fulfill the request.')
 print('Error code: ', e.code)
 print('url: ', url)
 else:
 # read().decode('utf-8') is necessary for Python 3.4
 json_dict = json.loads(data.read(), encoding='utf-8')
 return json_dict
def get_txt(url):
 """Downloads txt from url from chroniclingamerica.loc.gov 
 and saves as python str.
 Relies on valid url supplied by get_json()
 Parameters: url -> url for OCR text returned by get_json(): str
 Returns: retrieved_txt -> OCR text: str"""
 # TODO: return lists of missing & failed pages
 missing_pages = []
 failed_pages = []
 r = Request(url)
 try:
 data = urlopen(r)
 except URLError as e:
 if hasattr(e, 'reason'):
 print('We failed to reach a server.')
 print('Reason: ', e.reason)
 print('url: ', url)
 retrieved_txt = ('Likely Missing Page: Not digitized,'
 'published')
 missing_pages.append(url)
 elif hasattr(e, 'code'):
 print('The server couldn\'t fulfill the request.')
 print('Error code: ', e.code)
 print('url: ', url)
 retrieved_txt = 'Server didn\'t return any text'
 failed_pages.append(url)
 else:
 retrieved_txt = data.read().decode('utf-8')
 return retrieved_txt
def dwnld_page(url): # url of page
 """Downloads the OCR text of a newspaper page. Relies on valid 
 url from assemble_issue()
 Params: url -> url of OCR text of page: str
 Return: txt -> OCR text of a newspaper page: str"""
 txt_url = get_json(url)['text']
 txt = get_txt(txt_url)
 return txt
def assemble_issue(url): # url of issue
 """Assembles the OCR text for each page of a newspaper.
 Relies on valid url from dwnld_newspaper()
 Params: url -> url of newspaper issue: str
 Return: txt -> OCR text of all pages in newspaper: str"""
 issue_string = ''
 for page in get_json(url)['pages']:
 issue_string += dwnld_page(page['url'])
 return issue_string # str 'alltextforallpages'
def parse_date(datestring):
 """Converts YYYY-MM-DD string into date object
 Params: date -> str: 'YYYY-MM-DD'
 Return: return_date -> date"""
 date_fmt_str = '%Y-%m-%d'
 return_date = datetime.strptime(datestring, date_fmt_str).date()
 return return_date
def lccn_to_disk(dir_name, downloaded_issue):
 """Saves a dict of downloaded issues to disk. Creates a directory:
 dir_name
 |--key1.txt
 |--key2.txt
 +--key3.txt
 Params: dir_name -> str: name of created directory for data
 downloaded_issue -> dict: {'YYYY-MM-DD': 'string 
 of txt'}"""
 if not os.path.exists(dir_name):
 os.makedirs(dir_name)
 for date, text in downloaded_issue.items():
 with open(os.path.join(dir_name, date + '.txt'), 'w') as f:
 f.write(text)
 return

Question 2

Some of the things that may make your code better:

you can pass the whole newspaper object into the format string and use a multi-line string:

newspaper_string = """
 {newspaper.name} | Library of Congress No.: {newspaper.lccn} | {newspaper.place_of_publication}
 Published from {newspaper.start_year} to {newspaper.end_year} by {newspaper.publisher}
""".format(newspaper=newspaper_json)

Or, unpack the dictionary into keyword arguments:

newspaper_string = """
 {name} | Library of Congress No.: {lccn} | {place_of_publication}
 Published from {start_year} to {end_year} by {publisher}
""".format(**newspaper_json)

Same goes for the issues_string string definition.

you can further improve that by using f-strings (Python 3.6+)
there is not a lot of reasoning behind not using the full words instead of abbreviated for the function names - for instance, download_newspaper instead of dwnld_newspaper
the parse_date() is called twice per single iteration of the for issue in get_json(url)['issues'] loop. You can do it once and save the result into a variable

validate_chronam_url body may be simplified and improved to:

domain_matches = 'chroniclingamerica.loc.gov/lccn/sn' in url
is_json = '.json' in url
return domain_matches and is_json

you can use json.load(data, ...) instead of json.loads(data.read(), ...)

you can use str.join() for the assemble_issue function:

return ''.join(dwnld_page(page['url']) for page in get_json(url)['pages'])

you don't need a return in the lccn_to_disk function

Some other thoughts:

not sure that returning an exception instance in case things go wrong is a good idea. Either let them fail and propagate or handle - e.g. log the error and exit
consider using requests which would not only simplify the JSON response parsing part (there is a response.json() built-in method), but would also make the subsequent requests faster if you would use a single session instance
it feels like having a Newspaper class might be a good idea. At least you would have a nice way to hide the string representation of a newspaper representation string inside the __repr__ magic method
if performance is a concern, aside from requests.Session, you may improve on JSON parsing by switching to ujson parser

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-11-15 01:36:24Z

Some of the things that may make your code better:

you can pass the whole newspaper object into the format string and use a multi-line string:

newspaper_string = """
 {newspaper.name} | Library of Congress No.: {newspaper.lccn} | {newspaper.place_of_publication}
 Published from {newspaper.start_year} to {newspaper.end_year} by {newspaper.publisher}
""".format(newspaper=newspaper_json)

Or, unpack the dictionary into keyword arguments:

newspaper_string = """
 {name} | Library of Congress No.: {lccn} | {place_of_publication}
 Published from {start_year} to {end_year} by {publisher}
""".format(**newspaper_json)

Same goes for the issues_string string definition.

you can further improve that by using f-strings (Python 3.6+)
there is not a lot of reasoning behind not using the full words instead of abbreviated for the function names - for instance, download_newspaper instead of dwnld_newspaper
the parse_date() is called twice per single iteration of the for issue in get_json(url)['issues'] loop. You can do it once and save the result into a variable

validate_chronam_url body may be simplified and improved to:

domain_matches = 'chroniclingamerica.loc.gov/lccn/sn' in url
is_json = '.json' in url
return domain_matches and is_json

you can use json.load(data, ...) instead of json.loads(data.read(), ...)

you can use str.join() for the assemble_issue function:

return ''.join(dwnld_page(page['url']) for page in get_json(url)['pages'])

you don't need a return in the lccn_to_disk function

Some other thoughts:

not sure that returning an exception instance in case things go wrong is a good idea. Either let them fail and propagate or handle - e.g. log the error and exit
consider using requests which would not only simplify the JSON response parsing part (there is a response.json() built-in method), but would also make the subsequent requests faster if you would use a single session instance
it feels like having a Newspaper class might be a good idea. At least you would have a nice way to hide the string representation of a newspaper representation string inside the __repr__ magic method
if performance is a concern, aside from requests.Session, you may improve on JSON parsing by switching to ujson parser

Stack Exchange Network

Download and concatenate text files from multiple API endpoints

Background

Design

Questions

Callable Code

Supporting Functions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Download and concatenate text files from multiple API endpoints

Background

Design

Questions

Callable Code

Supporting Functions

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions