Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

pymupdf4llm for multi-page table #3954

Answered by JorjMcKie
bjmvercelli asked this question in Q&A
Discussion options

Hello, been trying to find a PDF parser tool that handles tables that starts and ends on different pages, without redeclaring columns.

Example:
image

Does pymupdf4llm handles this scenario?

You must be logged in to vote

No, we don't. This is a request that exceeds syntactical extraction logic. We are currently producing MD text page by page.
There is no effort to detect things crossing multiple pages. This not only applies to tables but also to e.g. text paragraphs.
To detect that a table on some page, actually continues the last table on the previous page would turn the existing (page-wise) logic on its head. In addition: if a table has no header row: how would we even ensure that it continues an earlier table:

  • Number of columns? No safe indicator!
  • In addition equal column widths? Still not safe. What is more: same column count but different column widths may still be a continuation.
  • So remains checkin...

Replies: 2 comments 1 reply

Comment options

No, we don't. This is a request that exceeds syntactical extraction logic. We are currently producing MD text page by page.
There is no effort to detect things crossing multiple pages. This not only applies to tables but also to e.g. text paragraphs.
To detect that a table on some page, actually continues the last table on the previous page would turn the existing (page-wise) logic on its head. In addition: if a table has no header row: how would we even ensure that it continues an earlier table:

  • Number of columns? No safe indicator!
  • In addition equal column widths? Still not safe. What is more: same column count but different column widths may still be a continuation.
  • So remains checking equal data types in each of the columns as in previous table ... sorry: this is simply beyond any reasonable scope.

If you know that your tables are continuations, you can still join them by exporting each to a pandas DataFrame and then use pandas means to join them. There is an example script in the utilities repo.

You must be logged in to vote
1 reply
Comment options

@JorjMcKie
I’m currently developing a program that uses AI to extract key information from PDF documents. My approach is to first parse the PDF into a Markdown file and then send it to the AI for further processing. However, while using your pymupdf4llm tool, I’ve noticed that it doesn’t fully extract all the text from the PDFs. For example, when tables span multiple pages, the tool fails to extract the complete table data. That said, even if it can’t handle such complex tables perfectly, I would expect it to at least return all the textual content available in the document.

Could you provide any guidance or suggestions on how to improve this aspect? Below is a snippet of the code I'm using:

 md_content = pymupdf4llm.to_markdown(self.pdf_file_path)
 pdf_file_name = os.path.splitext(os.path.basename(self.pdf_file_path))[0]
 md_file_path = f'{pdf_file_name}.md'
 with open(md_file_path, 'w', encoding='utf-8') as md_file:
 md_file.write(md_content)
 user_prompt = f"Here is the Markdown content containing order details:\n{md_content}"
 messages = [{"role": "system", "content": system_prompt},
 {"role": "user", "content": user_prompt}]
 response = client.chat.completions.create(
 model="deepseek-chat",
 messages=messages,
 response_format={
 'type': 'json_object'
 },
 max_tokens=6666,
 )
 result = json.loads(response.choices[0].message.content)
Answer selected by bjmvercelli
Comment options

@cobaltautomationdev please always provide reproducible data!
If things are not recognized as table content and neither as other text, then this is a bug and should be properly reported as such with a normal post as bug issue.

In your case, please use the issue tab of PyMuPDF4LLM, https://github.com/pymupdf/RAG/issues

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /