Description
I recently wrote a Python
script to download files from the Library of Congress (LOC) based on a search query. The code fetches metadata, extracts file URLs of a specific format (e.g., JPEG), and downloads them into a designated directory. I'm seeking feedback on its structure, efficiency, and readability.
Functionality
The script uses the Library of Congress API to:
- Retrieve Item IDs: Query the API for items matching the search URL and parse their IDs.
- Get File URLs: For each item, extract URLs for files of a specified MIME type (e.g., image/jpeg for .jpg).
- Download Files: Save the matching files locally with a logical directory and naming structure.
Key features include:
Code Walkthrough
Initialization (__init__)
: Validates inputs like search_url, file_extension, and save_to. Maps the specified file extension to a MIME type using MIME_TYPE_MAP.
Data Fetching (fetch_json)
: Sends HTTP requests with retries on failure. Ensures the response is JSON, logging warnings otherwise.
Item ID Retrieval (get_item_ids)
: Parses paginated results from the LOC API to collect relevant item IDs. Filters out items that don’t match criteria (e.g., excluding collections).
File URL Extraction (get_image_urls)
: Iterates through items, gathering file URLs matching the specified MIME type. Implements rate limiting to respect the server.
File Downloading (download_files)
: Saves files to a designated directory, ensuring unique filenames. Handles streaming file downloads in chunks for large files.
Main Workflow (run)
: Combines the above steps: fetch item IDs, extract file URLs, and download files.
Source Code
import requests
import os
import time
import logging
from tenacity import retry, wait_exponential, stop_after_attempt
from typing import List, Dict, Optional
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s"
)
class LOCImageDownloader:
"""
A class to download files from the Library of Congress based on a search query.
Example:
downloader = LOCImageDownloader(search_url, file_extension, save_to)
downloader.run()
"""
MIME_TYPE_MAP = {
"jpg": "image/jpeg",
"jpeg": "image/jpeg",
"tif": "image/tiff",
"tiff": "image/tiff",
"pdf": "application/pdf",
"png": "image/png",
}
def __init__(self, search_url: str, file_extension: str, save_to: str):
"""
Initialize the downloader with search URL, desired file extension, and save path.
Args:
search_url (str): The search URL to query the LOC API.
file_extension (str): Desired file extension (e.g., 'jpg', 'pdf').
save_to (str): Directory to save downloaded files.
Raises:
ValueError: If the specified file extension is unsupported.
"""
if not search_url.startswith("http"):
raise ValueError("Invalid search_url. Must start with http or https.")
if not os.access(os.path.dirname(save_to) or ".", os.W_OK):
raise ValueError(f"Save directory '{save_to}' is not writable.")
self.search_url = search_url
self.file_extension = file_extension.lower()
self.mime_type = self.MIME_TYPE_MAP.get(self.file_extension)
if not self.mime_type:
raise ValueError(f"Unsupported file extension: {self.file_extension}")
self.save_to = save_to
os.makedirs(self.save_to, exist_ok=True)
self.session = requests.Session()
@retry(wait=wait_exponential(multiplier=1, min=2, max=10), stop=stop_after_attempt(5))
def fetch_json(self, url: str, params: Optional[Dict] = None) -> Optional[Dict]:
"""Fetch JSON data from a given URL."""
try:
response = self.session.get(url, params=params, timeout=10)
response.raise_for_status()
if 'application/json' not in response.headers.get("Content-Type", ""):
logging.warning(f"Unexpected Content-Type for {url}: {response.headers.get('Content-Type')}")
return None
return response.json()
except requests.RequestException as e:
logging.error(f"Error fetching {url}: {e}")
return None
def get_item_ids(self) -> List[str]:
"""Retrieve item IDs from the LOC search URL."""
logging.info("Fetching item IDs from the Library of Congress...")
item_ids, url, count = [], self.search_url, 0
max_pages = 100 # Safeguard for pagination loops
pages_processed = 0
while url and pages_processed < max_pages:
data = self.fetch_json(url, {"fo": "json", "c": 100, "at": "results,pagination"})
if not data:
break
results = data.get("results", [])
for result in results:
if "item" in result.get("id", "") and "collection" not in result.get("original_format", []):
item_ids.append(result["id"])
count += len(results)
url = data.get("pagination", {}).get("next") # Get the next page URL
pages_processed += 1
# Be kind to the server (this is a magic number provided by the LOC)
time.sleep(1)
logging.info(f"Found {len(item_ids)} items.")
return item_ids
def get_image_urls(self, item_ids: List[str]) -> List[Dict[str, str]]:
"""Retrieve URLs for files matching the specified file type."""
logging.info("Retrieving file URLs for the specified items...")
file_urls, processed = [], 0
for item_url in item_ids:
data = self.fetch_json(item_url, {"fo": "json"})
processed += 1
logging.info(f"Processing item {processed}/{len(item_ids)}...")
if not data:
continue
for resource in data.get("resources", []):
for file_info in self.flatten_files(resource.get("files", [])):
if file_info.get("mimetype") == self.mime_type:
file_urls.append({"image_url": file_info["url"], "item_id": item_url})
# Be kind to the server (this is a magic number provided by the LOC)
time.sleep(2)
logging.info(f"Found {len(file_urls)} matching files.")
return file_urls
@staticmethod
def flatten_files(files: List) -> List[Dict]:
"""Flatten a potentially nested list of files into a single list of dictionaries."""
if not isinstance(files, list):
raise ValueError("Expected a list for 'files'")
return [item for sublist in files for item in (sublist if isinstance(sublist, list) else [sublist])]
def download_files(self, file_urls: List[Dict[str, str]]) -> None:
"""Download files from the given URLs and save them to the specified directory."""
logging.info("Downloading files...")
for index, file_info in enumerate(file_urls, start=1):
file_url = file_info["image_url"]
item_id = file_info["item_id"].strip("/").split("/")[-1]
save_path = os.path.join(self.save_to, item_id)
os.makedirs(save_path, exist_ok=True)
# Determine the filename
if "image-services/iiif" in file_url:
url_parts = file_url.split("/")
filename_part = next((part.split(":")[-1] for part in url_parts if "service:" in part), "image")
ext = file_url.split(".")[-1]
filename = f"{filename_part}.{ext}"
else:
filename = file_url.split("/")[-1]
# Ensure unique filename
file_path = os.path.join(save_path, filename)
counter = 1
while os.path.exists(file_path):
file_path = os.path.join(save_path, f"{filename.rsplit('.', 1)[0]}_{counter}.{self.file_extension}")
counter += 1
logging.info(f"[{index}/{len(file_urls)}] Downloading {file_url} as {file_path}...")
# Download the file
try:
with self.session.get(file_url, stream=True) as response:
response.raise_for_status()
with open(file_path, "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
except requests.RequestException as e:
logging.error(f"Error downloading {file_url}: {e}")
# Be kind to the server (this is a magic number provided by the LOC)
time.sleep(2)
def run(self) -> None:
"""Execute the downloader: fetch item IDs, retrieve file URLs, and download files."""
item_ids = self.get_item_ids()
if not item_ids:
logging.error("No items found.")
return
file_urls = self.get_image_urls(item_ids)
if not file_urls:
logging.error("No matching files found.")
return
self.download_files(file_urls)
Feedback Areas
The code seems to work both well and as expected. I haven't done much web scraping previously, however, and so I'm looking for general feedback. I'm most concerned about the following:
- Overall Readability: Is the overall flow of the program intuitive and maintainable?
- Error Handling: Are exceptions handled effectively, particularly for network-related errors? Are there additional checks or validations I should include?
- Best Practices: Are there any Pythonic improvements or standard practices I should adopt?
2 Answers 2
use of LLMs
Best Practices: Are there any ... standard practices I should adopt?
Yes. Adopt the following one.
Stop using AI crutches. They're not helping you. They are actively hindering your learning journey, and your ability to effectively communicate with colleagues. (I don't know which language model you used, so I will call your dog Rover and your model Claude.)
Don't ask Claude to "summarize the functionality of this code" -- it's
not helpful to reviewers. There are plenty of humans waiting in the
wings to insult my intelligence in one way or another, I don't need
another robot to do that for me.
Don't ask Claude to "provide a code walkthrough" -- the code should
be able to eloquently speak for itself.
If some of the methods are not essential to a high level overview,
then they should not be part of the Public API you designed,
and should be marked _private
so reviewers can skip over
them at first reading.
Do show reviewers what typical calling sequence and expected results would be. Which brings us to the next point.
doctest
Thank you for providing a class-level docstring.
class LOCImageDownloader:
"""
A class to download files from the Library of Congress based on a search query.
Example:
downloader = LOCImageDownloader(search_url, file_extension, save_to)
downloader.run()
"""
The one-sentence summary is fine. The rest of it is not.
The central thing you should be advising maintenance engineers of
is the class invariants you promise to preserve.
We see none of that.
Pushing that exposition to the __init__
ctor,
even in the form of documenting the meaning of self.this
and self.that
,
would be fine.
But instead we see an example calling sequence: construct it, then run it.
Which suggests that, IDK, fetching and getting should have
been private _fetch_json
, _get_item_ids
, and so on?
Elevating run()
and demoting the others leaves me with
more questions than you've supplied answers for.
These two lines of code would have gone nicely in an automated integration TestCase.
They might also have worked tolerably well in a doctest, though we usually think of those as being small unit tests, without internet interactions. Perhaps it would look like this. (I do not yet know what expected output / results would be.)
>>> downloader = LOCImageDownloader(search_url, file_extension, save_to)
>>> downloader.run()
foo
bar
baz
redundant docstring
If you intend to produce Sphinx documentation for folks who lack access to the source code, be sure to mention that in a posted config file or in the Review Context prose. I'm assuming that is not a part of this project.
def __init__(self, search_url: str, file_extension: str, save_to: str):
"""
Initialize the downloader with search URL, desired file extension, and save path.
That is an accurate, well-formed English sentence.
It didn't tell me anything beyond what I already knew from the signature.
Consider deleting the whole docstring, on the grounds that it is uninformative.
Its sole saving grace is "(e.g., 'jpg', 'pdf')", which helpfully tells
callers that sending in e.g. '.jpg'
would be Bad, send in 'jpg'
instead.
This is a slightly surprising convention, in that it ignores the
Path.suffix
convention of starting with .
dot,
but it is a convention and it is clearly documented, thank you.
save_to (str): Directory ...
This is regrettable in at least two ways.
The (str)
notation is redundant with the signature and could be
DRYed up.
And passing in a
Path
would be far more natural than passing in an arbitrary string.
It reveals intent.
And it offers a bunch of convenience methods.
Calling out that ValueError could be raised is maybe helpful.
This isn't java, so checked exception paranoia is not appropriate.
As we all know, a Bad Thing could happen at any time, and it's
not helpful to enumerate every single one of them.
The docstring tries to be helpful, but it uses the hopelessly
vague term "unsupported", without mentioning MIME_TYPE_MAP.
That is, the docstring is essentially asking the reader
to "go read the source code, because I'm too lazy to describe
the valid file_extension
values to you."
Omitting the entire note would have left the reader in the
identical situation.
A caller who attempts to pass in e.g. "gif"
will soon learn
the error of their ways, and will mend them, without need of
referring to the docstring.
The code can raise ValueError in more ways than the docstring alludes to.
We're all adults here. You don't need to tell about touching the hot stove. We won't.
Kudos on raising nice errors, with informative diagnostic messages.
And I do like the @retry
decorator.
For one thing, it relieves you of the documentation burden,
placing it upon the package maintainer instead.
unfortunate defaulting
if not os.access(os.path.dirname(save_to) or ".", os.W_OK):
That's just painful.
As mentioned above, the Path
API makes many things more pleasant,
including this error check.
Strongly prefer it over the ancient os.path
API.
And the disjunct is a botch on top of the wrong signature choice. We are looking at these possibilities:
- "" or "."
- "." or "."
- "./" or "."
Rather than a disjunct, or unconditionally appending "/"
,
it would be simpler to abandon str
in favor of Path
.
diagnostic quality
This is perfectly nice code. Feel free to keep it as-is.
self.mime_type = self.MIME_TYPE_MAP.get(self.file_extension)
if not self.mime_type:
raise ValueError(f"Unsupported file extension: {self.file_extension}")
The "unsupported" text could be viewed as helpful, friendly advice about what went wrong, and mentioning the offending extension is a boon to the maintenance engineer who is chasing down a bug.
But we might just as well have gone with this:
self.mime_type = self.MIME_TYPE_MAP[self.file_extension]
The KeyError would be just as fatal, though less diagnostic. But it would helpfully point out that MIME_TYPE_MAP is what defines the notion of "unsupported", something a maintenance engineer would have to go research anyway with the other diagnostic.
ancient annotation
def fetch_json( ... , params: Optional[Dict] = None) -> Optional[Dict]:
Type checking in python has been advancing by leaps and bounds in recent years. Prefer modern annotations:
def fetch_json( ... , params: dict | None = None) -> dict | None:
Kudos on the conventional response.raise_for_status()
.
But it's unclear why we don't raise
upon finding
application/json
missing from the content type.
swallowed an error
This is terrible:
except requests.RequestException as e:
logging.error(f"Error fetching {url}: {e}")
return None
It looks like pointless checked exception java code.
We had an exception. A perfectly good exception.
Which would have usefully propagated up the call stack,
advising caller that their request was "impossible".
But you swallowed it, returning None
instead.
Prefer raise
, or rather to omit the try
so it raises on its own.
inappropriate tuple unpack
Don't do this:
item_ids, url, count = [], self.search_url, 0
Those are three entirely unrelated assignments. They belong on three separate lines.
I don't understand these two lines of code:
max_pages = 100 # Safeguard for pagination loops
...
data = self.fetch_json(url, {"fo": "json", "c": 100, ... })
Was the intent perhaps to write { ..., "c": max_pages, ... }
?
Or are we seeing a different magic number of \100ドル\$ there?
I do like this defaulting: results = data.get("results", [])
dead store
We init count
to zero, and later we see
count += len(results)
But then it's never used. Simply delete that local variable.
redundant comment
url = data.get("pagination", {}).get("next") # Get the next page URL
Yeah, the code said that already.
Elide the #
comment.
wrong docstring
def get_image_urls(self, item_ids: ... ) -> ... :
"""Retrieve URLs for files matching the specified file type."""
What?!?
I see the specified item IDs. But file type?!? Who wrote this? You? Claude? Maybe the English sentence was once true, and then the code changed? It leaves us with so many questions.
Similarly here:
def download_files(self, file_urls: ... ) -> None:
"""Download files from the given URLs and save them to the specified directory."""
Perhaps once upon a time the signature specified a directory, but no longer.
nit, tuple unpack: file_urls, processed = [], 0
No, just stop it.
This is less worse, as there are only 2 unrelated items instead of 3.
In any event, thank you for citing LoC as the reference for those sleep() magic numbers.
extract helper
# Determine the filename
if ... :
filename = ...
else:
filename = ...
You felt this needed a #
comment.
Good, I'm glad y'all added one.
But that suggests we should break out a helper,
perhaps def determine_the_filename()
,
or some better function name.
The 8192 magic number is not the end of the world, but it would be worthwhile to assign it to some block size symbol.
docstring restates the code
def run(self) -> None:
"""Execute the downloader: fetch item IDs, retrieve file URLs, and download files."""
item_ids = self.get_item_ids()
...
file_urls = self.get_image_urls(item_ids)
...
self.download_files(file_urls)
If you're going to insult my intelligence in this way,
then start with i += 1 # Examines the current value of the index i and increments it by exactly 1
.
You might wish to choose a more informative identifier than run
.
Let us touch on the Design of Public API item for a moment.
A None
return value amounts to a fatal error.
So perhaps what caller really wanted was for those two
helpers to raise
fatal error upon encountering trouble.
-
1\$\begingroup\$ If this is based on an LLM, (1) what is the evidence? and (2) the question needs to be banished. \$\endgroup\$Reinderien– Reinderien2024年11月19日 13:43:10 +00:00Commented Nov 19, 2024 at 13:43
-
\$\begingroup\$ @Reinderien I think the theory is that the "code walkthrough" was LLM generated but not the code itself although I'd be happy for J_H to correct me \$\endgroup\$MD-Tech– MD-Tech2024年11月19日 16:47:21 +00:00Commented Nov 19, 2024 at 16:47
-
1\$\begingroup\$ @Reinderien, I'm willing to believe in the OP's authorship of the source code. But I have high confidence that some sections of the Review Context came straight from an LLM with little or no subsequent edits. To the extent that such prose winds up being part of Project Documentation, I object to burdening future maintainers with it. The OP contained no Makefile, but if the project "needs" LLM text it should be derived from the source artifacts, and re-derived each time those artifacts change. It gives the appearance of offering additional insights, yet it does not. \$\endgroup\$J_H– J_H2024年11月19日 18:08:06 +00:00Commented Nov 19, 2024 at 18:08
-
\$\begingroup\$ Thank you for your extensive comments, they are quite useful.With respect to my use of machine learning models: I did make use of an LLM here. Prior to posting, I ran my code through an LLM and asked for feedback, hoping to catch any glaring errors that I may have glossed over. I then used this feedback in informing the writing of this post, with the goal of improving the clarity of both my code walkthrough and requests with respect to types of feedback sought. The code and post are entirely my own, however, though evidently I may have mimicked the formatting of the LLM's response too closely. \$\endgroup\$IntegerEuler– IntegerEuler2024年11月19日 18:25:42 +00:00Commented Nov 19, 2024 at 18:25
-
1\$\begingroup\$ Thank you, @IntegerEuler. I have two bits of advice, and I'll observe that your impulse to offer an overview is a very good one which should be encouraged. (1.) Consider using any of the module-level, class-level, or
__init__
docstrings to offer such an overview, with the aim of helping a newly recruited team member quickly come up to speed on how to call into the library and how to interpret its results. Automated tests can really help with this. (2.) Look carefully at your IDE's auto-generated summary of public symbols. Design the Public API. Prune what a new teammate won't need. \$\endgroup\$J_H– J_H2024年11月19日 18:33:26 +00:00Commented Nov 19, 2024 at 18:33
Retrying
I was a bit surprised to see tenacity
here, since requests
can be used along urllib3
to deliver the desired behavior: Example: Automatic Retries. The example quoted shows some fine grained control options on status code or method, which could come in handy here. I would prefer this option over a generic library. But I would certainly keep tenacity
for unit testing purposes.
Then you could inhibit the other methods as well, not just fetch_json
. This would make the whole code more tolerant to transient errors, therefore more robust, which is the intention here but the job isn't finished. Because you are also using session
in download_files
without the retry functionality.
By the way you did something right here by using requests.Session()
, and not requests.get
...
Dataclasses
The class declaration could be simplified a bit using dataclasses.
Optimize
Your function get_item_ids
gets a bunch of IDs, and returns at the end of a loop. This could be potentially time- and resource-intensive.
If an error occurs in the middle, then the previously retrieved content is lost, and you would have to start over. It would make sense to return one item at a time, and maybe think about persistence, so that you can later resume where you left.
The computer does not have to remain idle during all this time. Your application can consume the results as they come back, and proceed with other tasks while your scraping goes on in the background, even though the target has limitations and you can't really fire a bunch of parallel requests at it.
I would consider using ThreadPool or threading-based alternatives to delegate the scraping to a separate process.
Filename
Regarding file name manipulation, using pathlib would simplify the job of extracting file parts (and also make the code more expressive than slicing).
Logging
It's good to have logging, but do output to a file as well. Debugging console output is tedious, and the buffer console size is limited anyway, so you would probably miss a lot of messages.
Checking response
One last thing: I noticed in fetch_json
that you do this:
if 'application/json' not in response.headers.get("Content-Type", ""):
logging.warning(f"Unexpected Content-Type for {url}: {response.headers.get('Content-Type')}")
return None
The expectation here is that you will be talking HTTP/1.1 with the server. In HTTP2, the headers are lowercase. Then the response would look like this instead:
content-type: application/json; charset=UTF-8
I don ́t think requests
currently support HTTP2, so it's probably not an issue for now.
My suggestion is to replace this check by handling requests.exceptions.JSONDecodeError
in this block. This is what will happen "If the response body does not contain valid json."