0

What are the details of your problem? I am a teacher and I want to use Python to create a worksheet for my students. I have a vocabulary PDF with content like this:

do your best duː jɔː best
33, 81
do your hair/make-up duː jɔː heə/ ˈmeɪkʌp
81
do/work overtime duː/wɜːk ˈəʊvətaɪm
36
do/write an essay duː/raɪt æn ˈeseɪ
33
document ˈdɒkjəmənt
54
documentary ˌdɒkjəˈmentəri
52
dollar ˈdɒlə
19
dolphin ˈdɒlfɪn
8
don’t worry dəʊnt ˈwʌri
65

I want to convert it into a table with three columns: vocab, API, and number that I can paste directly into Google Sheets, like this:

vocab API number
do your best duː jɔː best 33, 81
do your hair/make-up duː jɔː heə/ˈmeɪkʌp 81
do/work overtime duː/wɜːk ˈəʊvətaɪm 36
... ... ...

I tried using the following Python code to extract text from the PDF and save it to a CSV:

import os
import csv
from pdfminer.high_level import extract_text
base_path = r"C:\Users\PC\OneDrive\Desktop\New folder"
pdf_file = os.path.join(base_path, "vocab.pdf")
csv_file = os.path.join(base_path, "vocab.csv")
text = extract_text(pdf_file)
lines = text.splitlines()
with open(csv_file, "w", newline="", encoding="utf-8") as f:
 writer = csv.writer(f)
 for line in lines:
 if line.strip(): # bỏ dòng trống
 writer.writerow([line.strip()])
print(f"Done! File CSV đã được tạo ở: {csv_file}")

However, this only produces a blank CSV file.

What I was expecting: I want the CSV to have three separate columns: vocab, API, and number with each entry properly aligned, so I can paste it directly into Google Sheets.

2
  • How do you expect to split a line such as do/write an essay duː/raɪt æn ˈeseɪ into the phrase and pronunciation parts? Also, you're not accounting for the fact that the numbers are on separate lines Commented Sep 23, 2025 at 9:54
  • You need to convert the PDF file into plain text file first. Commented Sep 23, 2025 at 16:33

1 Answer 1

0

You need to account for two things that are peculiar with your data.

  1. The vocabulary / pronunciation parts are on separate lines to the numbers
  2. You need to isolate the vocabulary from the pronunciation
import csv
from pathlib import Path
from pdfminer.high_level import extract_text
# pylint: disable=invalid-name
BASE = Path("~").expanduser() # use HOME directory
pdf_in = BASE / "SO.pdf"
csv_out = BASE / "SO.csv"
with csv_out.open("w", encoding="utf-8", newline="") as _pdf:
 writer = csv.writer(_pdf)
 flag = False
 v, p = "", "" # vocabulary and pronunciation parts
 for line in map(str.strip, extract_text(pdf_in).splitlines()):
 if line:
 if flag:
 # replace multiple contiguous spaces with one space
 line = " ".join(line.split())
 writer.writerow([v, p, line])
 else:
 # assumes an even number of tokens
 m = len(tokens := line.split()) // 2
 v = " ".join(tokens[:m])
 p = " ".join(tokens[m:])
 flag = not flag
answered Sep 23, 2025 at 10:29
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.