-
Notifications
You must be signed in to change notification settings - Fork 661
pymupdf4llm for multi-page table #3954
-
Hello, been trying to find a PDF parser tool that handles tables that starts and ends on different pages, without redeclaring columns.
Example:
image
Does pymupdf4llm handles this scenario?
Beta Was this translation helpful? Give feedback.
All reactions
No, we don't. This is a request that exceeds syntactical extraction logic. We are currently producing MD text page by page.
There is no effort to detect things crossing multiple pages. This not only applies to tables but also to e.g. text paragraphs.
To detect that a table on some page, actually continues the last table on the previous page would turn the existing (page-wise) logic on its head. In addition: if a table has no header row: how would we even ensure that it continues an earlier table:
- Number of columns? No safe indicator!
- In addition equal column widths? Still not safe. What is more: same column count but different column widths may still be a continuation.
- So remains checkin...
Replies: 2 comments 1 reply
-
No, we don't. This is a request that exceeds syntactical extraction logic. We are currently producing MD text page by page.
There is no effort to detect things crossing multiple pages. This not only applies to tables but also to e.g. text paragraphs.
To detect that a table on some page, actually continues the last table on the previous page would turn the existing (page-wise) logic on its head. In addition: if a table has no header row: how would we even ensure that it continues an earlier table:
- Number of columns? No safe indicator!
- In addition equal column widths? Still not safe. What is more: same column count but different column widths may still be a continuation.
- So remains checking equal data types in each of the columns as in previous table ... sorry: this is simply beyond any reasonable scope.
If you know that your tables are continuations, you can still join them by exporting each to a pandas DataFrame and then use pandas means to join them. There is an example script in the utilities repo.
Beta Was this translation helpful? Give feedback.
All reactions
-
@JorjMcKie
I’m currently developing a program that uses AI to extract key information from PDF documents. My approach is to first parse the PDF into a Markdown file and then send it to the AI for further processing. However, while using your pymupdf4llm tool, I’ve noticed that it doesn’t fully extract all the text from the PDFs. For example, when tables span multiple pages, the tool fails to extract the complete table data. That said, even if it can’t handle such complex tables perfectly, I would expect it to at least return all the textual content available in the document.
Could you provide any guidance or suggestions on how to improve this aspect? Below is a snippet of the code I'm using:
md_content = pymupdf4llm.to_markdown(self.pdf_file_path)
pdf_file_name = os.path.splitext(os.path.basename(self.pdf_file_path))[0]
md_file_path = f'{pdf_file_name}.md'
with open(md_file_path, 'w', encoding='utf-8') as md_file:
md_file.write(md_content)
user_prompt = f"Here is the Markdown content containing order details:\n{md_content}"
messages = [{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}]
response = client.chat.completions.create(
model="deepseek-chat",
messages=messages,
response_format={
'type': 'json_object'
},
max_tokens=6666,
)
result = json.loads(response.choices[0].message.content)
Beta Was this translation helpful? Give feedback.
All reactions
-
@cobaltautomationdev please always provide reproducible data!
If things are not recognized as table content and neither as other text, then this is a bug and should be properly reported as such with a normal post as bug issue.
In your case, please use the issue tab of PyMuPDF4LLM, https://github.com/pymupdf/RAG/issues
Beta Was this translation helpful? Give feedback.