Unable to read table which contains sub/mini table inside parent table · pymupdf/PyMuPDF · Discussion #4632

SuryaV21
Jul 27, 2025

Hello guys,

I am trying to read different formats of tables from pdf using Pymupdf ==1.26.3 version however as in the attached pdf if i try to read it is not reading as expected. I have other table formats as well which are not properly read by find_tables and then converting to data frame.

@JorjMcKie Please help me on how to read the pdf

Thank you in advance

Replies: 4 comments 16 replies

JorjMcKie
Jul 27, 2025
Maintainer

Stacked tables are not supported, and will not be in the foreseeable future.
As per your example: What is wrong with this output:

['Protocol Title', 'Abc', None]
['Brief Title', 'Def', None]
['Study Intervention', 'ghi', None]
['Background and\nRationale', 'jkl', None]
['Objectives and\nEndpoints', 'Objectives', 'Endpoints']
['Primary', 'mno', 'pqr']
['Secondary', '• stu\n• vwx', '• yz\n• 123']
['Overall Design', '456\n789: 101112\n131415\n161718:192021', None]

Or this one using pandas:

 Protocol Title Abc Col2
0 Brief Title Def None
1 Study Intervention ghi None
2 Background and\nRationale jkl None
3 Objectives and\nEndpoints Objectives Endpoints
4 Primary mno pqr
5 Secondary • stu\n• vwx • yz\n• 123
6 Overall Design 456\n789: 101112\n131415\n161718:192021 None

3 replies

@SuryaV21

SuryaV21 Jul 27, 2025
Author

@JorjMcKie Please share the code to read the table

@JorjMcKie

JorjMcKie Jul 27, 2025
Maintainer

import pymupdf
doc = pymupdf.open("test.pdf")
page = doc[0]
tabs = page.find_tables()
tab = tabs[0]
for row in tab.extract():
 print(row)
print(tab.to_pandas())

@SuryaV21

SuryaV21 Jul 28, 2025
Author

Thank you @JorjMcKie. I want to pass the table information for llm model for QnA. can you please respond to my below 2nd scenario as well why table is not identified and how to extract the table data into a dataframe along with code

SuryaV21
Jul 27, 2025
Author

image

test.pdf

@JorjMcKie I do have one more table as shown in above image however that table is not getting identified by find_tables(). can you please help me with this code and explain why find_tables() function is not able to read this table.

in attached pdf there are few more tables which are not getting identified by find_tables() format. help me understand as my pdf contains many customized table formats.

7 replies

@JorjMcKie

JorjMcKie Jul 28, 2025
Maintainer

both pages contain 1 table which can be extracted
image

@SuryaV21

SuryaV21 Jul 28, 2025
Author

@JorjMcKie how can i know how many tables are present in the same to iterate the count of tables

@SuryaV21

SuryaV21 Jul 28, 2025
Author

@JorjMcKie after using the above code, it is not able to identify even table is present in pdf page however it is not getting identified with plain find_tables() function. is there any other way to identify the tables in pdf

@SuryaV21

SuryaV21 Jul 28, 2025
Author

@JorjMcKie i have 2 different types of table formats. As in the attched images, first image is only having rows but not columns and hence it is not able to identify it as a table whereas in image2 it is having both rows n columns and able to identify it as a table.

now i want image 1 also to be considered as a table even though it doesn't have columns but row structued as it is seperated based on the line header. Is there a way to identify the rows format also without any columns and treat this as a table.

From image 1 i would like to get below information:

header of the topic
table name
all section names which are categorized
how can we identify the section begining and ending of the sections
add the category name to relevant context across all the sections.

@JorjMcKie please let me know if you need any additional information.

image1:
image

image2:
image

@JorjMcKie

JorjMcKie Jul 28, 2025
Maintainer

The minimum table dimensions are 2 x 2. So 1-column / 1 row tables are not possible.

SuryaV21
Jul 28, 2025
Author

I used below code to identify whether the page contains only text or table or image or any of them or all of them. however, even my page contains complete image it shows table as present. please look at the below code and let me know if i did anything wrong.

def analyze_page_content(pdf_path):
# Open the PDF document
doc = fitz.open(pdf_path)
content_list = [] # Use a list to hold content analysis for each page

for page_number in range(len(doc)):
 page = doc[page_number]
 content_analysis = {
 "page_number": page_number,
 "contains_text": False,
 "contains_table": False,
 "contains_images": False
 }
 
 # Check for text
 text = page.get_text()
 if text.strip(): # If there's any non-whitespace text
 content_analysis["contains_text"] = True
 
 # Check for images
 image_list = page.get_images(full=True)
 if image_list: # If any images are found
 content_analysis["contains_images"] = True
 
 # Check for tables (using find_tables if applicable; may require a specific setup)
 # Note: Make sure to check if your version supports this
 try:
 # Check for any tables (this is pseudo-code; this function may not be available)
 tables = page.find_tables() # Uncomment if find_tables is supported in your version
 if tables: # If any tables found
 content_analysis["contains_table"] = True
 except Exception as e:
 print(f"Error checking for tables on page {page_number}: {e}")
 
 # Append the content analysis to the list
 content_list.append(content_analysis)
 # Print the analysis results for the current page
 print(content_analysis)
df = pd.DataFrame(content_list) # Create DataFrame from the list
# Close the document
doc.close()
return df

df = analyze_page_content(pdf_path)
df.to_csv('output/page_metadata.csv')
print(df.head())

4 replies

@JorjMcKie

JorjMcKie Jul 28, 2025
Maintainer

This
image

is the wrong check! Check for tables.tables. A TableFinder will always be created. But its internal list of Table objects may be empty.

@SuryaV21

SuryaV21 Jul 28, 2025
Author

@JorjMcKie can you please share the updated code

@JorjMcKie

JorjMcKie Jul 28, 2025
Maintainer

I have no updated code. Just noticed that error.

@SuryaV21

SuryaV21 Jul 28, 2025
Author

@JorjMcKie i tried it still it shows table however when i tried to read tables using find_tables() throws error. can you please help me understand why and how to fix the issue

SuryaV21
Jul 28, 2025
Author

@JorjMcKie i am searching to connect you in linkedin, can u please share the linkedin id if its ok with you

2 replies

@JorjMcKie

JorjMcKie Jul 28, 2025
Maintainer

This user id has no LinkedIn account.

@SuryaV21

SuryaV21 Jul 28, 2025
Author

@JorjMcKie can you please share your linkedin id to send connection request if its ok with you

Unable to read table which contains sub/mini table inside parent table #4632

Uh oh!

SuryaV21 Jul 27, 2025

Replies: 4 comments · 16 replies

Uh oh!

JorjMcKie Jul 27, 2025 Maintainer

Uh oh!

SuryaV21 Jul 27, 2025 Author

Uh oh!

JorjMcKie Jul 27, 2025 Maintainer

Uh oh!

SuryaV21 Jul 28, 2025 Author

Uh oh!

Uh oh!

SuryaV21 Jul 27, 2025 Author

Uh oh!

JorjMcKie Jul 28, 2025 Maintainer

Uh oh!

SuryaV21 Jul 28, 2025 Author

Uh oh!

SuryaV21 Jul 28, 2025 Author

Uh oh!

Uh oh!

SuryaV21 Jul 28, 2025 Author

Uh oh!

JorjMcKie Jul 28, 2025 Maintainer

Uh oh!

Uh oh!

SuryaV21 Jul 28, 2025 Author

Uh oh!

JorjMcKie Jul 28, 2025 Maintainer

Uh oh!

SuryaV21 Jul 28, 2025 Author

Uh oh!

JorjMcKie Jul 28, 2025 Maintainer

Uh oh!

Uh oh!

SuryaV21 Jul 28, 2025 Author

Uh oh!

SuryaV21 Jul 28, 2025 Author

Uh oh!

JorjMcKie Jul 28, 2025 Maintainer

Uh oh!

SuryaV21 Jul 28, 2025 Author

SuryaV21
Jul 27, 2025

Replies: 4 comments 16 replies

JorjMcKie
Jul 27, 2025
Maintainer

SuryaV21 Jul 27, 2025
Author

JorjMcKie Jul 27, 2025
Maintainer

SuryaV21 Jul 28, 2025
Author

SuryaV21
Jul 27, 2025
Author

JorjMcKie Jul 28, 2025
Maintainer

SuryaV21 Jul 28, 2025
Author

SuryaV21 Jul 28, 2025
Author

SuryaV21 Jul 28, 2025
Author

JorjMcKie Jul 28, 2025
Maintainer

SuryaV21
Jul 28, 2025
Author

JorjMcKie Jul 28, 2025
Maintainer

SuryaV21 Jul 28, 2025
Author

JorjMcKie Jul 28, 2025
Maintainer

SuryaV21 Jul 28, 2025
Author

SuryaV21
Jul 28, 2025
Author

JorjMcKie Jul 28, 2025
Maintainer

SuryaV21 Jul 28, 2025
Author