Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Unable to read table which contains sub/mini table inside parent table #4632

Unanswered
SuryaV21 asked this question in Looking for help
Discussion options

test.pdf

Hello guys,

I am trying to read different formats of tables from pdf using Pymupdf ==1.26.3 version however as in the attached pdf if i try to read it is not reading as expected. I have other table formats as well which are not properly read by find_tables and then converting to data frame.

@JorjMcKie Please help me on how to read the pdf

Thank you in advance

You must be logged in to vote

Replies: 4 comments 16 replies

Comment options

Stacked tables are not supported, and will not be in the foreseeable future.
As per your example: What is wrong with this output:

['Protocol Title', 'Abc', None]
['Brief Title', 'Def', None]
['Study Intervention', 'ghi', None]
['Background and\nRationale', 'jkl', None]
['Objectives and\nEndpoints', 'Objectives', 'Endpoints']
['Primary', 'mno', 'pqr']
['Secondary', '• stu\n• vwx', '• yz\n• 123']
['Overall Design', '456\n789: 101112\n131415\n161718:192021', None]

Or this one using pandas:

 Protocol Title Abc Col2
0 Brief Title Def None
1 Study Intervention ghi None
2 Background and\nRationale jkl None
3 Objectives and\nEndpoints Objectives Endpoints
4 Primary mno pqr
5 Secondary • stu\n• vwx • yz\n• 123
6 Overall Design 456\n789: 101112\n131415\n161718:192021 None
You must be logged in to vote
3 replies
Comment options

@JorjMcKie Please share the code to read the table

Comment options

import pymupdf
doc = pymupdf.open("test.pdf")
page = doc[0]
tabs = page.find_tables()
tab = tabs[0]
for row in tab.extract():
 print(row)
print(tab.to_pandas())
Comment options

Thank you @JorjMcKie. I want to pass the table information for llm model for QnA. can you please respond to my below 2nd scenario as well why table is not identified and how to extract the table data into a dataframe along with code

Comment options

image

test.pdf

@JorjMcKie I do have one more table as shown in above image however that table is not getting identified by find_tables(). can you please help me with this code and explain why find_tables() function is not able to read this table.

in attached pdf there are few more tables which are not getting identified by find_tables() format. help me understand as my pdf contains many customized table formats.

You must be logged in to vote
7 replies
Comment options

both pages contain 1 table which can be extracted
image

Comment options

@JorjMcKie how can i know how many tables are present in the same to iterate the count of tables

Comment options

@JorjMcKie after using the above code, it is not able to identify even table is present in pdf page however it is not getting identified with plain find_tables() function. is there any other way to identify the tables in pdf

Comment options

@JorjMcKie i have 2 different types of table formats. As in the attched images, first image is only having rows but not columns and hence it is not able to identify it as a table whereas in image2 it is having both rows n columns and able to identify it as a table.

now i want image 1 also to be considered as a table even though it doesn't have columns but row structued as it is seperated based on the line header. Is there a way to identify the rows format also without any columns and treat this as a table.

From image 1 i would like to get below information:

  • header of the topic
  • table name
  • all section names which are categorized
  • how can we identify the section begining and ending of the sections
  • add the category name to relevant context across all the sections.

@JorjMcKie please let me know if you need any additional information.

image1:
image

image2:
image

Comment options

The minimum table dimensions are 2 x 2. So 1-column / 1 row tables are not possible.

Comment options

I used below code to identify whether the page contains only text or table or image or any of them or all of them. however, even my page contains complete image it shows table as present. please look at the below code and let me know if i did anything wrong.

def analyze_page_content(pdf_path):
# Open the PDF document
doc = fitz.open(pdf_path)
content_list = [] # Use a list to hold content analysis for each page

for page_number in range(len(doc)):
 page = doc[page_number]
 content_analysis = {
 "page_number": page_number,
 "contains_text": False,
 "contains_table": False,
 "contains_images": False
 }
 
 # Check for text
 text = page.get_text()
 if text.strip(): # If there's any non-whitespace text
 content_analysis["contains_text"] = True
 
 # Check for images
 image_list = page.get_images(full=True)
 if image_list: # If any images are found
 content_analysis["contains_images"] = True
 
 # Check for tables (using find_tables if applicable; may require a specific setup)
 # Note: Make sure to check if your version supports this
 try:
 # Check for any tables (this is pseudo-code; this function may not be available)
 tables = page.find_tables() # Uncomment if find_tables is supported in your version
 if tables: # If any tables found
 content_analysis["contains_table"] = True
 except Exception as e:
 print(f"Error checking for tables on page {page_number}: {e}")
 
 # Append the content analysis to the list
 content_list.append(content_analysis)
 # Print the analysis results for the current page
 print(content_analysis)
df = pd.DataFrame(content_list) # Create DataFrame from the list
# Close the document
doc.close()
return df

df = analyze_page_content(pdf_path)
df.to_csv('output/page_metadata.csv')
print(df.head())

You must be logged in to vote
4 replies
Comment options

This
image

is the wrong check! Check for tables.tables. A TableFinder will always be created. But its internal list of Table objects may be empty.

Comment options

@JorjMcKie can you please share the updated code

Comment options

I have no updated code. Just noticed that error.

Comment options

@JorjMcKie i tried it still it shows table however when i tried to read tables using find_tables() throws error. can you please help me understand why and how to fix the issue

Comment options

@JorjMcKie i am searching to connect you in linkedin, can u please share the linkedin id if its ok with you

You must be logged in to vote
2 replies
Comment options

This user id has no LinkedIn account.

Comment options

@JorjMcKie can you please share your linkedin id to send connection request if its ok with you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /