Skip to main content
Stack Overflow
  1. About
  2. For Teams
Filter by
Sorted by
Tagged with
0 votes
0 answers
19 views

I am trying to use tabula-py's read_pdf function to read data from a pdf. import tabula path = "./numbers.pdf" print("reading") data = tabula.read_pdf(path,pages=[2,3],...
Jack's user avatar
  • 171
0 votes
0 answers
235 views

My use case contains textual table data but the column header cell values have multiple lines in them(image shared). which results in bad parsing by PyMuPdf. I have tried Camelot and Tabula as well, ...
0 votes
0 answers
71 views

I'm trying to extract some data from a pdf table, I used the Tabula.exe app at the beginning and after selecting the wanted area the resulting csv is how I want it. I exported the template and I tried ...
0 votes
0 answers
86 views

Is there a way to convert below pdf files data to dataframe ? https://www.onrr.gov/document/2018.pdf https://www.onrr.gov/document/2021.pdf I have used 'tabula-py' to convert above pdfs to dataframe. ...
0 votes
1 answer
140 views

I have a pdf that has some text at the top in the first page and then table starts. The table extends throughout the pdf (of 156 pages). I want to extract this table into csv.I have succesfully done ...
0 votes
0 answers
119 views

Was using tabula-py for extracting tabular information and then storing it in .csv files however it fails to understand the structure of the tableScreenshot of pdf using as a dataset Real structure of ...
1 vote
0 answers
253 views

I'm trying to use the tabula-py library in a Python virtual environment on Windows 11. Java is installed on my system, and java -version works outside the venv. However, inside the venv, I get 'java' ...
1 vote
0 answers
59 views

I got these warnings when I use tabula-py. Apr 24, 2024 10:15:55 AM org.apache.pdfbox.pdmodel.PDDocument importPage WARNING: inherited resources of source document are not imported to destination page ...
1 vote
0 answers
246 views

All my current code: import tabula pdfpath= "Testpdfs/HSA certCut.pdf" sbc = tabula.read_pdf(pdfpath, stream=True, pages=4, format="CSV")[0] print(sbc) I have a fresh install ...
2 votes
0 answers
80 views

I have a pdf with 4 pages containing 98 rows of tabular data. However when use tabula, last line from each page is getting excluded in the final output. Below is the code: import tabula tabula....
0 votes
0 answers
102 views

I am facing a problem for while. I am working on a project where I have to make a rest api where it will extract texts from pdf and make json data out of it. The pdf format will be same all time. And ...
0 votes
1 answer
21 views

I use tabula-py to extract the pdf table content, the output for numeric as text such as 010019 or 0007 is always convert to float. Is there any way to fix it to return correct value (0007 instead 7....
1 vote
1 answer
313 views

I'm currently attempting to convert a PDF file containing Hindi Devanagari script to a CSV file using the fitz library in Python, but when I read in the text I encounter a strange encoding issue. Here ...
1 vote
0 answers
81 views

I get stuck again with PDF scraping and observe that columns do not correspond to some of the values that I obtain for those columns. Basically, I want to obtain a CSV file, but first I want to ...
0 votes
1 answer
60 views

Is there a way to maintain leading zeros in cells while still using the tabula-py convert_into function? Perhaps by passing something into the 'options' parameter to read them as strings? The ...
Nick08's user avatar
  • 167

15 30 50 per page
1
2 3 4 5
...
9

AltStyle によって変換されたページ (->オリジナル) /