-
Notifications
You must be signed in to change notification settings - Fork 650
Unable to read table which contains sub/mini table inside parent table #4632
-
Hello guys,
I am trying to read different formats of tables from pdf using Pymupdf ==1.26.3 version however as in the attached pdf if i try to read it is not reading as expected. I have other table formats as well which are not properly read by find_tables and then converting to data frame.
@JorjMcKie Please help me on how to read the pdf
Thank you in advance
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 4 comments 16 replies
-
Stacked tables are not supported, and will not be in the foreseeable future.
As per your example: What is wrong with this output:
['Protocol Title', 'Abc', None] ['Brief Title', 'Def', None] ['Study Intervention', 'ghi', None] ['Background and\nRationale', 'jkl', None] ['Objectives and\nEndpoints', 'Objectives', 'Endpoints'] ['Primary', 'mno', 'pqr'] ['Secondary', '• stu\n• vwx', '• yz\n• 123'] ['Overall Design', '456\n789: 101112\n131415\n161718:192021', None]
Or this one using pandas:
Protocol Title Abc Col2
0 Brief Title Def None
1 Study Intervention ghi None
2 Background and\nRationale jkl None
3 Objectives and\nEndpoints Objectives Endpoints
4 Primary mno pqr
5 Secondary • stu\n• vwx • yz\n• 123
6 Overall Design 456\n789: 101112\n131415\n161718:192021 None
Beta Was this translation helpful? Give feedback.
All reactions
-
@JorjMcKie Please share the code to read the table
Beta Was this translation helpful? Give feedback.
All reactions
-
import pymupdf doc = pymupdf.open("test.pdf") page = doc[0] tabs = page.find_tables() tab = tabs[0] for row in tab.extract(): print(row) print(tab.to_pandas())
Beta Was this translation helpful? Give feedback.
All reactions
-
Thank you @JorjMcKie. I want to pass the table information for llm model for QnA. can you please respond to my below 2nd scenario as well why table is not identified and how to extract the table data into a dataframe along with code
Beta Was this translation helpful? Give feedback.
All reactions
-
@JorjMcKie I do have one more table as shown in above image however that table is not getting identified by find_tables(). can you please help me with this code and explain why find_tables() function is not able to read this table.
in attached pdf there are few more tables which are not getting identified by find_tables() format. help me understand as my pdf contains many customized table formats.
Beta Was this translation helpful? Give feedback.
All reactions
-
both pages contain 1 table which can be extracted
image
Beta Was this translation helpful? Give feedback.
All reactions
-
@JorjMcKie how can i know how many tables are present in the same to iterate the count of tables
Beta Was this translation helpful? Give feedback.
All reactions
-
@JorjMcKie after using the above code, it is not able to identify even table is present in pdf page however it is not getting identified with plain find_tables() function. is there any other way to identify the tables in pdf
Beta Was this translation helpful? Give feedback.
All reactions
-
@JorjMcKie i have 2 different types of table formats. As in the attched images, first image is only having rows but not columns and hence it is not able to identify it as a table whereas in image2 it is having both rows n columns and able to identify it as a table.
now i want image 1 also to be considered as a table even though it doesn't have columns but row structued as it is seperated based on the line header. Is there a way to identify the rows format also without any columns and treat this as a table.
From image 1 i would like to get below information:
- header of the topic
- table name
- all section names which are categorized
- how can we identify the section begining and ending of the sections
- add the category name to relevant context across all the sections.
@JorjMcKie please let me know if you need any additional information.
image1:
image
image2:
image
Beta Was this translation helpful? Give feedback.
All reactions
-
The minimum table dimensions are 2 x 2. So 1-column / 1 row tables are not possible.
Beta Was this translation helpful? Give feedback.
All reactions
-
I used below code to identify whether the page contains only text or table or image or any of them or all of them. however, even my page contains complete image it shows table as present. please look at the below code and let me know if i did anything wrong.
def analyze_page_content(pdf_path):
# Open the PDF document
doc = fitz.open(pdf_path)
content_list = [] # Use a list to hold content analysis for each page
for page_number in range(len(doc)):
page = doc[page_number]
content_analysis = {
"page_number": page_number,
"contains_text": False,
"contains_table": False,
"contains_images": False
}
# Check for text
text = page.get_text()
if text.strip(): # If there's any non-whitespace text
content_analysis["contains_text"] = True
# Check for images
image_list = page.get_images(full=True)
if image_list: # If any images are found
content_analysis["contains_images"] = True
# Check for tables (using find_tables if applicable; may require a specific setup)
# Note: Make sure to check if your version supports this
try:
# Check for any tables (this is pseudo-code; this function may not be available)
tables = page.find_tables() # Uncomment if find_tables is supported in your version
if tables: # If any tables found
content_analysis["contains_table"] = True
except Exception as e:
print(f"Error checking for tables on page {page_number}: {e}")
# Append the content analysis to the list
content_list.append(content_analysis)
# Print the analysis results for the current page
print(content_analysis)
df = pd.DataFrame(content_list) # Create DataFrame from the list
# Close the document
doc.close()
return df
df = analyze_page_content(pdf_path)
df.to_csv('output/page_metadata.csv')
print(df.head())
Beta Was this translation helpful? Give feedback.
All reactions
-
This
image
is the wrong check! Check for tables.tables
. A TableFinder
will always be created. But its internal list of Table
objects may be empty.
Beta Was this translation helpful? Give feedback.
All reactions
-
@JorjMcKie can you please share the updated code
Beta Was this translation helpful? Give feedback.
All reactions
-
I have no updated code. Just noticed that error.
Beta Was this translation helpful? Give feedback.
All reactions
-
@JorjMcKie i tried it still it shows table however when i tried to read tables using find_tables() throws error. can you please help me understand why and how to fix the issue
Beta Was this translation helpful? Give feedback.
All reactions
-
@JorjMcKie i am searching to connect you in linkedin, can u please share the linkedin id if its ok with you
Beta Was this translation helpful? Give feedback.
All reactions
-
This user id has no LinkedIn account.
Beta Was this translation helpful? Give feedback.
All reactions
-
@JorjMcKie can you please share your linkedin id to send connection request if its ok with you
Beta Was this translation helpful? Give feedback.