pymupdf4llm for multi-page table · pymupdf/PyMuPDF · Discussion #3954

bjmvercelli
Oct 16, 2024

Hello, been trying to find a PDF parser tool that handles tables that starts and ends on different pages, without redeclaring columns.

Example:
image

Does pymupdf4llm handles this scenario?

Answered by JorjMcKie

Oct 17, 2024

No, we don't. This is a request that exceeds syntactical extraction logic. We are currently producing MD text page by page.
There is no effort to detect things crossing multiple pages. This not only applies to tables but also to e.g. text paragraphs.
To detect that a table on some page, actually continues the last table on the previous page would turn the existing (page-wise) logic on its head. In addition: if a table has no header row: how would we even ensure that it continues an earlier table:

Number of columns? No safe indicator!
In addition equal column widths? Still not safe. What is more: same column count but different column widths may still be a continuation.
So remains checkin...

View full answer

Replies: 2 comments 1 reply

JorjMcKie
Oct 17, 2024
Maintainer

Number of columns? No safe indicator!
In addition equal column widths? Still not safe. What is more: same column count but different column widths may still be a continuation.
So remains checking equal data types in each of the columns as in previous table ... sorry: this is simply beyond any reasonable scope.

If you know that your tables are continuations, you can still join them by exporting each to a pandas DataFrame and then use pandas means to join them. There is an example script in the utilities repo.

1 reply

@cobaltautomationdev

cobaltautomationdev Apr 24, 2025

@JorjMcKie
I’m currently developing a program that uses AI to extract key information from PDF documents. My approach is to first parse the PDF into a Markdown file and then send it to the AI for further processing. However, while using your pymupdf4llm tool, I’ve noticed that it doesn’t fully extract all the text from the PDFs. For example, when tables span multiple pages, the tool fails to extract the complete table data. That said, even if it can’t handle such complex tables perfectly, I would expect it to at least return all the textual content available in the document.

Could you provide any guidance or suggestions on how to improve this aspect? Below is a snippet of the code I'm using:

 md_content = pymupdf4llm.to_markdown(self.pdf_file_path)
 pdf_file_name = os.path.splitext(os.path.basename(self.pdf_file_path))[0]
 md_file_path = f'{pdf_file_name}.md'
 with open(md_file_path, 'w', encoding='utf-8') as md_file:
 md_file.write(md_content)
 user_prompt = f"Here is the Markdown content containing order details:\n{md_content}"
 messages = [{"role": "system", "content": system_prompt},
 {"role": "user", "content": user_prompt}]
 response = client.chat.completions.create(
 model="deepseek-chat",
 messages=messages,
 response_format={
 'type': 'json_object'
 },
 max_tokens=6666,
 )
 result = json.loads(response.choices[0].message.content)

Answer selected by bjmvercelli

JorjMcKie
Apr 24, 2025
Maintainer

@cobaltautomationdev please always provide reproducible data!
If things are not recognized as table content and neither as other text, then this is a bug and should be properly reported as such with a normal post as bug issue.

In your case, please use the issue tab of PyMuPDF4LLM, https://github.com/pymupdf/RAG/issues

0 replies

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pymupdf4llm for multi-page table #3954

Uh oh!

{{title}}

Uh oh!

bjmvercelli
Oct 16, 2024

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

JorjMcKie
Oct 17, 2024
Maintainer

Uh oh!

{{title}}

Uh oh!

cobaltautomationdev Apr 24, 2025

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

JorjMcKie
Apr 24, 2025
Maintainer

Select a reply

Uh oh!

pymupdf4llm for multi-page table #3954

Uh oh!

bjmvercelli Oct 16, 2024

Replies: 2 comments · 1 reply

Uh oh!

JorjMcKie Oct 17, 2024 Maintainer

Uh oh!

cobaltautomationdev Apr 24, 2025

Uh oh!

Uh oh!

JorjMcKie Apr 24, 2025 Maintainer

bjmvercelli
Oct 16, 2024

Replies: 2 comments 1 reply

JorjMcKie
Oct 17, 2024
Maintainer

JorjMcKie
Apr 24, 2025
Maintainer