Skip to main content
Stack Overflow
  1. About
  2. For Teams
Filter by
Sorted by
Tagged with
0 votes
1 answer
158 views

I'm working on a C#/Blazor project that extracts text from PDF files and stores it in an Elasticsearch index for full-text search. The issue: when the PDF was originally a PowerPoint presentation, the ...
0 votes
0 answers
80 views

Our application processes uploaded PDFs by extracting individual pages, grouping them per user, and then creating a new PDF that is encrypted with a password. However, when we apply the encryption ...
0 votes
0 answers
235 views

My use case contains textual table data but the column header cell values have multiple lines in them(image shared). which results in bad parsing by PyMuPdf. I have tried Camelot and Tabula as well, ...
0 votes
1 answer
58 views

I'm trying to extract tables and text from pdf and them ask questions regarding the pdf's with the help of llms . However when i run the code , it shows 10061 error , I think this is because I'm using ...
0 votes
1 answer
55 views

I am using fitz/pymupdf and pdf2docx packages of python to read tables from pdf files so that I can get data out of them and model it appropriately for storage in a data lake. It seems like Converter ...
1 vote
0 answers
61 views

using this py code import fitz # Open the PDF file file = fitz.open("D41813 LS K2W COLLAR TOP 000751556 (0318).pdf") # Iterate through each page of the PDF for pageNumber, page in ...
0 votes
0 answers
174 views

I want to extract screenshots of questions from a pdf exam paper. I wanted the question and its parts to be separated, so the actual question's introduction would be in a different screen shot and the ...
1 vote
0 answers
74 views

I am trying to convert a PDF file to an XML file using PDFQuery and then extracting data from it using bbox coordinates. However, the converted XML file is often missing some of the data present in ...
0 votes
1 answer
162 views

Extracting text from PDF file using iText7 8.0.4 MemoryStream pdfStream = ... pdfStream.Position = 0; var strategy = new LocationTextExtractionStrategy(); var reader = new PdfReader(pdfStream); ...
Andrus's user avatar
  • 28.3k
0 votes
2 answers
503 views

I am trying to parse an unstructured PDF file and extract information about some input fields like radiobuttons, but I am not sure how to do that. I tried using get_fields from PyPDF2, it does not ...
1 vote
2 answers
645 views

I'am trying to extrat text from a pdf with python. None of the packages I tried could read it (PyPDF2,pdfminer,fitz etc.), but some of them could return me the cid encodings. (eg. (cid:3) ). Now I ...
3 votes
0 answers
854 views

I am trying convert scanned pdf into searchable pdf using ocrmypdf. In few cases its throwing the error of The output file size is ×ばつ larger than the input file. Possible reasons for this include: ...
0 votes
1 answer
191 views

this is my first time posting here on stack overflow because I really have nowhere else to turn. My problem is extracting a specific table from a PDF-file containing multiple tables, and converting ...
0 votes
0 answers
76 views

I am new with reading text from pdf using python. I am using tika to extract content from pdf, and when it extracts bold headings, it seems to fail. example image In the example above, it's reads &...
0 votes
1 answer
643 views

I am currently confronted with an issue related to the processing of PDF files generated through Ghostscript. Specifically, when attempting to extract text from these PDFs using pdfminer and fitz, I ...

15 30 50 per page
1
2 3 4 5
...
11

AltStyle によって変換されたページ (->オリジナル) /