Name	Name	Last commit message	Last commit date
Latest commit History 7 Commits
bin	bin
docs	docs
src/pdftable	src/pdftable
tests	tests
.env.example	.env.example
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
README.md	README.md
pyproject.toml	pyproject.toml
requirements.txt	requirements.txt
setup.py	setup.py

pdf_table

install

# install ghostscript for pdf to image
apt install ghostscript
# install pdftable
#pip install pdftable
python setup.py install

Usage

env

To download model from modelscope, please set the environment variable PDFTABLE_USE_MODELSCOPE_HUB to 1, otherwise huggingface will be used by default to download the model.

model

I have uploaded the related model to the Hugging Face and Modelscope platform. When using the specified table identification model, the model weight will be downloaded to the local. The configuration file path of all models used in the project is (https://github.com/CycloneBoy/pdf_table/blob/main/src/pdftable/model/ocr_pdf/ocr_table_model_config.py), you can refer to.

cli

# pdftable --help
usage: pdftable [-h] --output_dir OUTPUT_DIR --file_path_or_url FILE_PATH_OR_URL [--lang LANG] [--debug [DEBUG]] [--pages PAGES]
 [--html_page_merge_sep HTML_PAGE_MERGE_SEP] [--detect_model DETECT_MODEL] [--detect_db_thresh DETECT_DB_THRESH]
 [--recognizer_model RECOGNIZER_MODEL] [--recognizer_task_type RECOGNIZER_TASK_TYPE]
 [--table_structure_model TABLE_STRUCTURE_MODEL] [--table_structure_task_type TABLE_STRUCTURE_TASK_TYPE]
 [--layout_model LAYOUT_MODEL]
options:
 -h, --help show this help message and exit
 --output_dir OUTPUT_DIR, --output-dir OUTPUT_DIR
 The output directory (default: None)
 --file_path_or_url FILE_PATH_OR_URL, --file-path-or-url FILE_PATH_OR_URL
 file path or url (default: None)
 --lang LANG ocr recognition language (default: en)
 --debug [DEBUG] debug mode (default: False)
 --pages PAGES need process page. Comma-separated page numbers. Example: '1,3,4' or '1,4-end' or 'all'. (default: all)
 --html_page_merge_sep HTML_PAGE_MERGE_SEP, --html-page-merge-sep HTML_PAGE_MERGE_SEP
 The delimiter that separates each page of PDF conversion results in the final converted html result page.
 (default: @@@@@@)
 --detect_model DETECT_MODEL, --detect-model DETECT_MODEL
 ocr detect model, optional items: PP-OCRv4, PP-OCRv3, resnet18, proxylessnas. (default: PP-OCRv4)
 --detect_db_thresh DETECT_DB_THRESH, --detect-db-thresh DETECT_DB_THRESH
 db threshold (default: 0.2)
 --recognizer_model RECOGNIZER_MODEL, --recognizer-model RECOGNIZER_MODEL
 ocr recognize model, optional items: PP-OCRv4, PP-OCRv3, PP-Table, ConvNextViT, CRNN, LightweightEdge (default:
 PP-OCRv4)
 --recognizer_task_type RECOGNIZER_TASK_TYPE, --recognizer-task-type RECOGNIZER_TASK_TYPE
 ocr recognizer task type, It only takes effect when recognizer_model is ConvNextViT, optional items: general,
 handwritten, document, licenseplate, scene. (default: document)
 --table_structure_model TABLE_STRUCTURE_MODEL, --table-structure-model TABLE_STRUCTURE_MODEL
 table structure model, optional items: CenterNet, SLANet, Lore, Lgpma, MtlTabNet, TableMaster, LineCell. (default:
 Lore)
 --table_structure_task_type TABLE_STRUCTURE_TASK_TYPE, --table-structure-task-type TABLE_STRUCTURE_TASK_TYPE
 table structure task type, optional items: ptn, wtw, wireless, fin. ptn represents the data set as PubTabNet. fin
 represents FinTabNet, which is only valid when the table_structure_model is MtlTabNet. (default: wtw)
 --layout_model LAYOUT_MODEL, --layout-model LAYOUT_MODEL
 layout model, optional items: picodet, DocXLayout (default: picodet)

TODO

Write project documentation
Optimize project code
Add the latest table recognition model
other

Thanks to the following projects

References

If you use pdf_table in your projects, please consider citing the following:

@article{sheng2024pdftable,
 title={PdfTable: A Unified Toolkit for Deep Learning-Based Table Extraction},
 author={Sheng, Lei and Xu, Shuai-Shuai},
 journal={arXiv preprint arXiv:2409.05125},
 url = {https://arxiv.org/abs/2409.05125},
 eprint = {2409.05125},
 doi = {10.48550/arXiv.2409.05125},
 year={2024}
}

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CycloneBoy/pdf_table

Folders and files

Latest commit

History

Repository files navigation

pdf_table

install

Usage

env

model

cli

TODO

Thanks to the following projects

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

CycloneBoy/pdf_table

Folders and files

Latest commit

History

Repository files navigation

pdf_table

install

Usage

env

model

cli

TODO

Thanks to the following projects

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages