How to format a vocabulary list into a table using Python for Google Sheets

Question 1

What are the details of your problem? I am a teacher and I want to use Python to create a worksheet for my students. I have a vocabulary PDF with content like this:

do your best duː jɔː best
33, 81
do your hair/make-up duː jɔː heə/ ˈmeɪkʌp
81
do/work overtime duː/wɜːk ˈəʊvətaɪm
36
do/write an essay duː/raɪt æn ˈeseɪ
33
document ˈdɒkjəmənt
54
documentary ˌdɒkjəˈmentəri
52
dollar ˈdɒlə
19
dolphin ˈdɒlfɪn
8
don’t worry dəʊnt ˈwʌri
65

I want to convert it into a table with three columns: vocab, API, and number that I can paste directly into Google Sheets, like this:

vocab API number
do your best duː jɔː best 33, 81
do your hair/make-up duː jɔː heə/ˈmeɪkʌp 81
do/work overtime duː/wɜːk ˈəʊvətaɪm 36
... ... ...

I tried using the following Python code to extract text from the PDF and save it to a CSV:

import os
import csv
from pdfminer.high_level import extract_text
base_path = r"C:\Users\PC\OneDrive\Desktop\New folder"
pdf_file = os.path.join(base_path, "vocab.pdf")
csv_file = os.path.join(base_path, "vocab.csv")
text = extract_text(pdf_file)
lines = text.splitlines()
with open(csv_file, "w", newline="", encoding="utf-8") as f:
 writer = csv.writer(f)
 for line in lines:
 if line.strip(): # bỏ dòng trống
 writer.writerow([line.strip()])
print(f"Done! File CSV đã được tạo ở: {csv_file}")

However, this only produces a blank CSV file.

What I was expecting: I want the CSV to have three separate columns: vocab, API, and number with each entry properly aligned, so I can paste it directly into Google Sheets.

Question 2

How do you expect to split a line such as do/write an essay duː/raɪt æn ˈeseɪ into the phrase and pronunciation parts? Also, you're not accounting for the fact that the numbers are on separate lines

Question 3

You need to convert the PDF file into plain text file first.

Question 4

You need to account for two things that are peculiar with your data.

The vocabulary / pronunciation parts are on separate lines to the numbers
You need to isolate the vocabulary from the pronunciation

import csv
from pathlib import Path
from pdfminer.high_level import extract_text
# pylint: disable=invalid-name
BASE = Path("~").expanduser() # use HOME directory
pdf_in = BASE / "SO.pdf"
csv_out = BASE / "SO.csv"
with csv_out.open("w", encoding="utf-8", newline="") as _pdf:
 writer = csv.writer(_pdf)
 flag = False
 v, p = "", "" # vocabulary and pronunciation parts
 for line in map(str.strip, extract_text(pdf_in).splitlines()):
 if line:
 if flag:
 # replace multiple contiguous spaces with one space
 line = " ".join(line.split())
 writer.writerow([v, p, line])
 else:
 # assumes an even number of tokens
 m = len(tokens := line.split()) // 2
 v = " ".join(tokens[:m])
 p = " ".join(tokens[m:])
 flag = not flag

jackal 29.1k3 gold badges10 silver badges28 bronze badges · Accepted Answer · 2025-09-23 10:29:20Z

You need to account for two things that are peculiar with your data.

The vocabulary / pronunciation parts are on separate lines to the numbers
You need to isolate the vocabulary from the pronunciation

import csv
from pathlib import Path
from pdfminer.high_level import extract_text
# pylint: disable=invalid-name
BASE = Path("~").expanduser() # use HOME directory
pdf_in = BASE / "SO.pdf"
csv_out = BASE / "SO.csv"
with csv_out.open("w", encoding="utf-8", newline="") as _pdf:
 writer = csv.writer(_pdf)
 flag = False
 v, p = "", "" # vocabulary and pronunciation parts
 for line in map(str.strip, extract_text(pdf_in).splitlines()):
 if line:
 if flag:
 # replace multiple contiguous spaces with one space
 line = " ".join(line.split())
 writer.writerow([v, p, line])
 else:
 # assumes an even number of tokens
 m = len(tokens := line.split()) // 2
 v = " ".join(tokens[:m])
 p = " ".join(tokens[m:])
 flag = not flag

CollectivesTM on Stack Overflow

How to format a vocabulary list into a table using Python for Google Sheets

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related