134 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
0
votes
0
answers
19
views
tabula-py read_pdf not running
I am trying to use tabula-py's read_pdf function to read data from a pdf.
import tabula
path = "./numbers.pdf"
print("reading")
data = tabula.read_pdf(path,pages=[2,3],...
0
votes
0
answers
235
views
handling complex tables with PyMuPdf
My use case contains textual table data but the column header cell values have multiple lines in them(image shared). which results in bad parsing by PyMuPdf. I have tried Camelot and Tabula as well, ...
0
votes
0
answers
71
views
Tabula GUI and Tabula-py give a different outcome
I'm trying to extract some data from a pdf table, I used the Tabula.exe app at the beginning and after selecting the wanted area the resulting csv is how I want it. I exported the template and I tried ...
0
votes
0
answers
86
views
convert PDF file data to dataframe in python?
Is there a way to convert below pdf files data to dataframe ?
https://www.onrr.gov/document/2018.pdf
https://www.onrr.gov/document/2021.pdf
I have used 'tabula-py' to convert above pdfs to dataframe. ...
0
votes
1
answer
140
views
Tabula-Py getting confused with column names
I have a pdf that has some text at the top in the first page and then table starts. The table extends throughout the pdf (of 156 pages). I want to extract this table into csv.I have succesfully done ...
0
votes
0
answers
119
views
Is there any module other than tabula-py and camelot to extract tables from native pdfs?
Was using tabula-py for extracting tabular information and then storing it in .csv files however it fails to understand the structure of the tableScreenshot of pdf using as a dataset Real structure of ...
1
vote
0
answers
253
views
Java not recognized in Python venv on Windows 11
I'm trying to use the tabula-py library in a Python virtual environment on Windows 11. Java is installed on my system, and java -version works outside the venv. However, inside the venv, I get 'java' ...
1
vote
0
answers
59
views
Warnings when I use tabula-py
I got these warnings when I use tabula-py.
Apr 24, 2024 10:15:55 AM org.apache.pdfbox.pdmodel.PDDocument importPage
WARNING: inherited resources of source document are not imported to destination page
...
1
vote
0
answers
246
views
Fatal Java error when trying to use Tabula-py
All my current code:
import tabula
pdfpath= "Testpdfs/HSA certCut.pdf"
sbc = tabula.read_pdf(pdfpath, stream=True, pages=4, format="CSV")[0]
print(sbc)
I have a fresh install ...
2
votes
0
answers
80
views
Tabula- Last line from each page not getting extracted using python
I have a pdf with 4 pages containing 98 rows of tabular data. However when use tabula, last line from each page is getting excluded in the final output. Below is the code:
import tabula
tabula....
0
votes
0
answers
102
views
Getting broken text while reading pdf written in eastern language in python
I am facing a problem for while. I am working on a project where I have to make a rest api where it will extract texts from pdf and make json data out of it. The pdf format will be same all time. And ...
0
votes
1
answer
21
views
Is there possible the tabula-py extract numeric 007 as 007 instead 7?
I use tabula-py to extract the pdf table content, the output for numeric as text such as 010019 or 0007 is always convert to float. Is there any way to fix it to return correct value (0007 instead 7....
1
vote
1
answer
313
views
Encoding Issue When Attempting to Convert Hindi Script PDF to CSV in Python
I'm currently attempting to convert a PDF file containing Hindi Devanagari script to a CSV file using the fitz library in Python, but when I read in the text I encounter a strange encoding issue.
Here ...
1
vote
0
answers
81
views
PDF scraping, tabula py - columns do not correspond with "true" values of PDF file
I get stuck again with PDF scraping and observe that columns do not correspond to some of the values that I obtain for those columns. Basically, I want to obtain a CSV file, but first I want to ...
0
votes
1
answer
60
views
Keep Leading Zeros in Converted CSV Using Tabular-Py and Pandas
Is there a way to maintain leading zeros in cells while still using the tabula-py convert_into function? Perhaps by passing something into the 'options' parameter to read them as strings? The ...