Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Single430/pdf_table_parse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

2 Commits

Repository files navigation

pdf_table_parse

pdf table parse based on pdf2htmlex implementation

  • pdf2htmlEX
  • A program written a long time ago, there are still many shortcomings. I hope that more people can open source it.

requirements

tornado
beautifulsoup4
numpy
Pillow
ztools # 可删除

docker

$ docker images
bwits/pdf2htmlex latest

run server

python pdf_to_html_to_table_server.py

test

import json
import base64
import requests
file_name = 'H2_AN202001131373938984_1.pdf'
with open(f"source/{file_name}", 'rb') as fIo:
 data = {
 'pdf': base64.b64encode(fIo.read()),
 'startPage': 0,
 'endPage': 10,
 'pdfName': file_name
 }
 resp = requests.post('http://127.0.0.1:13131/parser/pdf2table', data=data)
 print(json.dumps(resp.json(), ensure_ascii=False, indent=4))
 with open('source/table_{}.html'.format(file_name), 'w') as fileIo:
 content = resp.json()['all_page_tables_html']
 fileIo.write(content)

{
 "all_page_tables_html": "",
 "all_table": [],
 "pdf_name": "H2_AN202001131373938984_1.pdf",
 "code": 200,
 "message": "success"
}

H2_AN202001131373938984_1.pdf table_H2_AN202001131373938984_1.pdf.html

About

pdf table parse based on pdf2htmlex implementation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /