Prerequisites: python3.10
Install Dependencies
linux/osx
apt-get/yum/brew install libreoffice
windows
install libreoffice
append "install_dir\LibreOffice\program" to ENVIRONMENT PATH
Install Magic-Doc
pip install fairy-doc[cpu] # cpu version or pip install fairy-doc[gpu] # gpu version
Magic-Doc is a lightweight open-source tool that allows users to convert multiple file type (PPT/PPTX/DOC/DOCX/PDF) to markdown. It supports both local file and S3 file.
# for local file from magic_doc.docconv import DocConverter, S3Config converter = DocConverter(s3_config=None) markdown_content, time_cost = converter.convert("some_doc.pptx", conv_timeout=300)
# for remote file located in aws s3 from magic_doc.docconv import DocConverter, S3Config s3_config = S3Config(ak='${ak}', sk='${sk}', endpoint='${endpoint}') converter = DocConverter(s3_config=s3_config) markdown_content, time_cost = converter.convert("s3://some_bucket/some_doc.pptx", conv_timeout=300)
ENV: AMD EPYC 7742 64-Core Processor, NVIDIA A100, Centos 7
| File Type | Speed |
|---|---|
| PDF (digital) | 347 (page/s) |
| PDF (ocr) | 2.7 (page/s) |
| PPT | 20 (page/s) |
| PPTX | 149 (page/s) |
| DOC | 600 (page/s) |
| DOCX | 1482 (page/s) |
@misc{2024magic-doc, title={Magic-Doc: A Toolkit that Converts Multiple File Types to Markdown}, author={Magic-Doc Contributors}, howpublished = {\url{https://github.com/InternLM/magic-doc}}, year={2024} }
This project is released under the Apache 2.0 license.