Newest 'pdf-extraction' Questions

1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

151 questions

0 votes

1 answer

158 views

How can i properly extract text from a PDF File, to store it in a Elastic-Search Index?

I'm working on a C#/Blazor project that extracts text from PDF files and stores it in an Elasticsearch index for full-text search. The issue: when the PDF was originally a PowerPoint presentation, the ...

fabsg0's user avatar

fabsg0

asked Jun 25, 2025 at 6:56

0 votes

0 answers

80 views

Issue with corrupt PDF when encrypting pages directly using PDFBox

Our application processes uploaded PDFs by extracting individual pages, grouping them per user, and then creating a new PDF that is encrypted with a password. However, when we apply the encryption ...

Lvkas_'s user avatar

Lvkas_

asked Apr 7, 2025 at 14:59

0 votes

0 answers

235 views

handling complex tables with PyMuPdf

My use case contains textual table data but the column header cell values have multiple lines in them(image shared). which results in bad parsing by PyMuPdf. I have tried Camelot and Tabula as well, ...

Arbaaz Ali's user avatar

Arbaaz Ali

asked Mar 24, 2025 at 16:54

0 votes

1 answer

58 views

Text extraction from PDF , ConnectError: [WinError 10061] No connection could be made because the target machine actively refused it

I'm trying to extract tables and text from pdf and them ask questions regarding the pdf's with the help of llms . However when i run the code , it shows 10061 error , I think this is because I'm using ...

Professor Chimp's user avatar

Professor Chimp

asked Nov 23, 2024 at 10:47

0 votes

1 answer

55 views

Overwrite a property in a used (but not imported) Class

I am using fitz/pymupdf and pdf2docx packages of python to read tables from pdf files so that I can get data out of them and model it appropriately for storage in a data lake. It seems like Converter ...

swygerts's user avatar

swygerts

asked Nov 13, 2024 at 17:18

1 vote

0 answers

61 views

Image extraction from the PDF file

using this py code import fitz # Open the PDF file file = fitz.open("D41813 LS K2W COLLAR TOP 000751556 (0318).pdf") # Iterate through each page of the PDF for pageNumber, page in ...

dheeraj gakkampudi's user avatar

dheeraj gakkampudi

asked Jul 25, 2024 at 13:37

0 votes

0 answers

174 views

Extracting screenshots from an Exam Paper for questions and their parts

I want to extract screenshots of questions from a pdf exam paper. I wanted the question and its parts to be separated, so the actual question's introduction would be in a different screen shot and the ...

Mario_Dev's user avatar

Mario_Dev

asked Jun 18, 2024 at 6:48

1 vote

0 answers

74 views

Losing Data when using PDFQuery to convert PDF to XML

I am trying to convert a PDF file to an XML file using PDFQuery and then extracting data from it using bbox coordinates. However, the converted XML file is often missing some of the data present in ...

Priyanshu Lahiri's user avatar

Priyanshu Lahiri

asked May 23, 2024 at 4:28

0 votes

1 answer

162 views

Extracting text using iText7 throws exception

Extracting text from PDF file using iText7 8.0.4 MemoryStream pdfStream = ... pdfStream.Position = 0; var strategy = new LocationTextExtractionStrategy(); var reader = new PdfReader(pdfStream); ...

Andrus's user avatar

Andrus

28.3k

asked May 22, 2024 at 7:36

0 votes

2 answers

503 views

How to identify input fields in unstructured PDFs using Python

I am trying to parse an unstructured PDF file and extract information about some input fields like radiobuttons, but I am not sure how to do that. I tried using get_fields from PyPDF2, it does not ...

Priyanshu Lahiri's user avatar

Priyanshu Lahiri

asked May 22, 2024 at 7:01

1 vote

2 answers

645 views

CID encoding of font

I'am trying to extrat text from a pdf with python. None of the packages I tried could read it (PyPDF2,pdfminer,fitz etc.), but some of them could return me the cid encodings. (eg. (cid:3) ). Now I ...

Franciska's user avatar

Franciska

asked Jan 28, 2024 at 12:11

3 votes

0 answers

854 views

I am using ocrmypdf for converting the scanned pdf to searchable pdf. I am getting the dependency error of jbig2 and pngquant - "was not found"

I am trying convert scanned pdf into searchable pdf using ocrmypdf. In few cases its throwing the error of The output file size is ×ばつ larger than the input file. Possible reasons for this include: ...

Sisir Das's user avatar

Sisir Das

asked Jan 3, 2024 at 7:43

0 votes

1 answer

191 views

Problem extracting a specific table from a PDF-page with multiple tables. (Python)

this is my first time posting here on stack overflow because I really have nowhere else to turn. My problem is extracting a specific table from a PDF-file containing multiple tables, and converting ...

Zain Fendukly's user avatar

Zain Fendukly

asked Oct 3, 2023 at 14:38

0 votes

0 answers

76 views

Extraction issue with bold heading letters from pdf using tika

I am new with reading text from pdf using python. I am using tika to extract content from pdf, and when it extracts bold headings, it seems to fail. example image In the example above, it's reads &...

Glinty's user avatar

Glinty

asked Sep 29, 2023 at 0:13

0 votes

1 answer

643 views

'pdf device does not support type 3 fonts' when trying to process a PDF generated by Ghostscript using pdfminer and fitz

I am currently confronted with an issue related to the processing of PDF files generated through Ghostscript. Specifically, when attempting to extract text from these PDFs using pdfminer and fitz, I ...

Abhishek Yadav's user avatar

Abhishek Yadav

asked Sep 14, 2023 at 8:06

15 30 50 per page

2 3 4 5

...

11 Next

CollectivesTM on Stack Overflow

How can i properly extract text from a PDF File, to store it in a Elastic-Search Index?

Issue with corrupt PDF when encrypting pages directly using PDFBox

handling complex tables with PyMuPdf

Text extraction from PDF , ConnectError: [WinError 10061] No connection could be made because the target machine actively refused it

Overwrite a property in a used (but not imported) Class

Image extraction from the PDF file

Extracting screenshots from an Exam Paper for questions and their parts

Losing Data when using PDFQuery to convert PDF to XML

Extracting text using iText7 throws exception

How to identify input fields in unstructured PDFs using Python

CID encoding of font

I am using ocrmypdf for converting the scanned pdf to searchable pdf. I am getting the dependency error of jbig2 and pngquant - "was not found"

Problem extracting a specific table from a PDF-page with multiple tables. (Python)

Extraction issue with bold heading letters from pdf using tika

'pdf device does not support type 3 fonts' when trying to process a PDF generated by Ghostscript using pdfminer and fitz

Hot Network Questions