What are the details of your problem? I am a teacher and I want to use Python to create a worksheet for my students. I have a vocabulary PDF with content like this:
do your best duː jɔː best
33, 81
do your hair/make-up duː jɔː heə/ ˈmeɪkʌp
81
do/work overtime duː/wɜːk ˈəʊvətaɪm
36
do/write an essay duː/raɪt æn ˈeseɪ
33
document ˈdɒkjəmənt
54
documentary ˌdɒkjəˈmentəri
52
dollar ˈdɒlə
19
dolphin ˈdɒlfɪn
8
don’t worry dəʊnt ˈwʌri
65
I want to convert it into a table with three columns: vocab, API, and number that I can paste directly into Google Sheets, like this:
vocab API number
do your best duː jɔː best 33, 81
do your hair/make-up duː jɔː heə/ˈmeɪkʌp 81
do/work overtime duː/wɜːk ˈəʊvətaɪm 36
... ... ...
I tried using the following Python code to extract text from the PDF and save it to a CSV:
import os
import csv
from pdfminer.high_level import extract_text
base_path = r"C:\Users\PC\OneDrive\Desktop\New folder"
pdf_file = os.path.join(base_path, "vocab.pdf")
csv_file = os.path.join(base_path, "vocab.csv")
text = extract_text(pdf_file)
lines = text.splitlines()
with open(csv_file, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
for line in lines:
if line.strip(): # bỏ dòng trống
writer.writerow([line.strip()])
print(f"Done! File CSV đã được tạo ở: {csv_file}")
However, this only produces a blank CSV file.
What I was expecting: I want the CSV to have three separate columns: vocab, API, and number with each entry properly aligned, so I can paste it directly into Google Sheets.
1 Answer 1
You need to account for two things that are peculiar with your data.
- The vocabulary / pronunciation parts are on separate lines to the numbers
- You need to isolate the vocabulary from the pronunciation
import csv
from pathlib import Path
from pdfminer.high_level import extract_text
# pylint: disable=invalid-name
BASE = Path("~").expanduser() # use HOME directory
pdf_in = BASE / "SO.pdf"
csv_out = BASE / "SO.csv"
with csv_out.open("w", encoding="utf-8", newline="") as _pdf:
writer = csv.writer(_pdf)
flag = False
v, p = "", "" # vocabulary and pronunciation parts
for line in map(str.strip, extract_text(pdf_in).splitlines()):
if line:
if flag:
# replace multiple contiguous spaces with one space
line = " ".join(line.split())
writer.writerow([v, p, line])
else:
# assumes an even number of tokens
m = len(tokens := line.split()) // 2
v = " ".join(tokens[:m])
p = " ".join(tokens[m:])
flag = not flag
Comments
Explore related questions
See similar questions with these tags.
do/write an essay duː/raɪt æn ˈeseɪinto the phrase and pronunciation parts? Also, you're not accounting for the fact that the numbers are on separate lines