-
-
Notifications
You must be signed in to change notification settings - Fork 471
Add PDF Image Extractor script with README documentation #500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a new utility script that recursively extracts all embedded images from PDF files in a directory tree. The script uses PyMuPDF (fitz) to process PDFs and supports optional deduplication of images per document.
Key changes:
- Adds
pdf_image_extractor.pywith command-line interface for PDF image extraction - Includes comprehensive README with usage examples and documentation
- Supports customizable output directory and per-PDF deduplication options
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| PDF Image Extractor/pdf_image_extractor.py | Main script implementing recursive PDF scanning and image extraction logic with CLI argument parsing |
| PDF Image Extractor/README.md | Documentation covering requirements, usage, CLI options, and output structure |
Code Quality Observations:
The implementation is generally well-structured with clear separation of concerns. However, there are a few technical issues to address:
-
Potential crash with
os.path.commonpath()(lines 14-16): The code usesos.path.commonpath([pdf_path, output_root])which can raise aValueErroron Windows when paths are on different drives, or when they don't share a common ancestor. This would crash the script in common scenarios where users specify an output directory on a different drive. The logic appears intended to mirror the directory structure, but using the common path as the base is problematic. A simpler approach would be to calculate the relative path from the inputpdf_dirdirectory. -
Inefficient directory creation logic (lines 35-36): The condition
if img_count == 0 and not os.path.exists(output_folder)only creates the directory before writing the first image. Whileos.makedirs()is called withexist_ok=True, the double-check is redundant. It would be clearer to create the directory once before the loop if there are images to extract. -
Redundant deduplication checks (lines 27-30): The code checks
if deduptwice - once to skip duplicates and again to add to the set. This could be simplified to a single conditional block. -
Missing requirements.txt: Several other projects in this repository include a
requirements.txtfile (e.g., PDF Merger, Image Watermarker, Image to ASCII). Adding one for this project would improve consistency and make dependency installation clearer for users. -
Missing error handling for image extraction: If
doc.extract_image(xref)fails (line 32), the script will crash. While PyMuPDF is generally robust, adding a try-except block would make the script more resilient.
Documentation:
The README is well-written with clear examples and appropriate detail. The structure follows good practices with separate sections for requirements, usage, examples, and output structure.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a 100ドル gift card. Take the survey.
This PR adds a new script, PDF Image Extractor, which recursively scans a directory tree for PDF files and extracts all embedded images from each document.
PDFwithin the input root directory by default (customizable via--out).--dedupflag to enable per-PDF deduplication of images.Additional notes: