151 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
0
votes
1
answer
158
views
How can i properly extract text from a PDF File, to store it in a Elastic-Search Index?
I'm working on a C#/Blazor project that extracts text from PDF files and stores it in an Elasticsearch index for full-text search.
The issue: when the PDF was originally a PowerPoint presentation, the ...
0
votes
0
answers
80
views
Issue with corrupt PDF when encrypting pages directly using PDFBox
Our application processes uploaded PDFs by extracting individual pages, grouping them per user, and then creating a new PDF that is encrypted with a password. However, when we apply the encryption ...
0
votes
0
answers
235
views
handling complex tables with PyMuPdf
My use case contains textual table data but the column header cell values have multiple lines in them(image shared). which results in bad parsing by PyMuPdf. I have tried Camelot and Tabula as well, ...
0
votes
1
answer
58
views
Text extraction from PDF , ConnectError: [WinError 10061] No connection could be made because the target machine actively refused it
I'm trying to extract tables and text from pdf and them ask questions regarding the pdf's with the help of llms . However when i run the code , it shows 10061 error , I think this is because I'm using ...
0
votes
1
answer
55
views
Overwrite a property in a used (but not imported) Class
I am using fitz/pymupdf and pdf2docx packages of python to read tables from pdf files so that I can get data out of them and model it appropriately for storage in a data lake.
It seems like Converter ...
1
vote
0
answers
61
views
Image extraction from the PDF file
using this py code
import fitz
# Open the PDF file
file = fitz.open("D41813 LS K2W COLLAR TOP 000751556 (0318).pdf")
# Iterate through each page of the PDF
for pageNumber, page in ...
0
votes
0
answers
174
views
Extracting screenshots from an Exam Paper for questions and their parts
I want to extract screenshots of questions from a pdf exam paper. I wanted the question and its parts to be separated, so the actual question's introduction would be in a different screen shot and the ...
1
vote
0
answers
74
views
Losing Data when using PDFQuery to convert PDF to XML
I am trying to convert a PDF file to an XML file using PDFQuery and then extracting data from it using bbox coordinates. However, the converted XML file is often missing some of the data present in ...
0
votes
1
answer
162
views
Extracting text using iText7 throws exception
Extracting text from PDF file using iText7 8.0.4
MemoryStream pdfStream = ...
pdfStream.Position = 0;
var strategy = new LocationTextExtractionStrategy();
var reader = new PdfReader(pdfStream);
...
0
votes
2
answers
503
views
How to identify input fields in unstructured PDFs using Python
I am trying to parse an unstructured PDF file and extract information about some input fields like radiobuttons, but I am not sure how to do that.
I tried using get_fields from PyPDF2, it does not ...
1
vote
2
answers
645
views
CID encoding of font
I'am trying to extrat text from a pdf with python. None of the packages I tried could read it (PyPDF2,pdfminer,fitz etc.), but some of them could return me the cid encodings. (eg. (cid:3) ).
Now I ...
3
votes
0
answers
854
views
I am using ocrmypdf for converting the scanned pdf to searchable pdf. I am getting the dependency error of jbig2 and pngquant - "was not found"
I am trying convert scanned pdf into searchable pdf using ocrmypdf. In few cases its throwing the error of
The output file size is ×ばつ larger than the input file.
Possible reasons for this include:
...
0
votes
1
answer
191
views
Problem extracting a specific table from a PDF-page with multiple tables. (Python)
this is my first time posting here on stack overflow because I really have nowhere else to turn.
My problem is extracting a specific table from a PDF-file containing multiple tables, and converting ...
0
votes
0
answers
76
views
Extraction issue with bold heading letters from pdf using tika
I am new with reading text from pdf using python. I am using tika to extract content from pdf, and when it extracts bold headings, it seems to fail.
example image
In the example above, it's reads &...
0
votes
1
answer
643
views
'pdf device does not support type 3 fonts' when trying to process a PDF generated by Ghostscript using pdfminer and fitz
I am currently confronted with an issue related to the processing of PDF files generated through Ghostscript. Specifically, when attempting to extract text from these PDFs using pdfminer and fitz, I ...