I wrote a simple command-line script to count all words in all documents (of the supported formats) in the current directory. Currently it supports TXT, DOCX, XLSX, and PDF formats and has been satisfactorily tested with them. As a freelance translator and content writer, this script provides me with an excellent tool, to quickly evaluate the scope of large projects by simply "dropping" the script into a directory and running it from PowerShell/Terminal.
Currently tested in Windows 10 only.
What do you think of this script? What should I improve?
import os
import openpyxl
import platform
import docx2txt
import PyPDF2
def current_dir():
if platform.system() == "Windows":
directory = os.listdir(".\\")
else:
directory = os.getcwd()
return directory
def excel_counter(filename):
count = 0
wb = openpyxl.load_workbook(filename)
for sheet in wb:
for row in sheet:
for cell in row:
text = str(cell.value)
if text != "None":
word_list = text.split()
count += len(word_list)
return count
def pdf_counter(filename):
pdf_word_count = 0
pdfFileObj = open(filename, "rb")
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages = pdfReader.getNumPages() - 1
for page in range(0, number_of_pages + 1):
page_contents = pdfReader.getPage(page - 1)
raw_text = page_contents.extractText()
text = raw_text.encode('utf-8')
page_word_count = len(text.split())
pdf_word_count += page_word_count
return pdf_word_count
def main():
word_count = 0
print(f"Current Directory: {os.getcwd()}")
for file in current_dir():
file_name_list = os.path.splitext(file)
extension = file_name_list[1]
if extension == ".xlsx":
current_count = excel_counter(file)
print(f"{file} {current_count}")
word_count += current_count
if extension == ".docx":
text = docx2txt.process(file)
current_count = len(text.split())
print(f"{file} {current_count}")
word_count += current_count
if extension == ".txt":
f = open(file, "r")
text = f.read()
current_count = len(text.split())
print(f"{file} {current_count}")
word_count += current_count
if extension == ".pdf":
pdf_word_count = pdf_counter(file)
print(f"{file} {pdf_word_count}")
word_count += pdf_word_count
else:
pass
print(f"Total: {word_count}")
main()
-
\$\begingroup\$ What do you consider to be a word? \$\endgroup\$AMC– AMC2020年03月13日 22:50:00 +00:00Commented Mar 13, 2020 at 22:50
2 Answers 2
I recommend using pathlib and Path objects instead of os, and you should use a context manager when manipulating files (e.g. with open("file.txt", "r") as file: ...
). You also have a lot of repeating code when you're checking extensions, and you keep checking the rest of the if statements even if it's matched an earlier one. And the final else: pass
does literally nothing so just remove that.
You could also do something about your nested for loops for sheet, row and cell (you'd typically use zip or itertools.product then) but this is kind of readable and nice so I'm not sure it's worth the conversion.
Refactor the code so each file type gets it's own function for counting the words like excel_counter()
and pdf_counter()
. Then use a dict to map file extensions to the functions.
Something like:
def docx_counter(file):
text = docx2txt.process(file)
return len(text.split())
def txt_counter(file):
f = open(file, "r")
text = f.read()
return len(text.split())
def unknown_counter(file):
print(f"Don't know how to process {file}.")
return 0
def main():
word_count = 0
print(f"Current Directory: {os.getcwd()}")
counter = {
".xlsx":excel_counter,
".docx":docx_counter,
".txt":txt_counter,
".pdf":pdf_counter
}
for file in current_dir():
file_name_list = os.path.splitext(file)
extension = file_name_list[1]
current_count = counter.get(extension, null_counter)(file)
print(f"{file} {current_count}")
word_count += current_count
print(f"Total: {word_count}")
-
\$\begingroup\$ Redefining the
counter
dictionary for every file seems like a waste. Jut move it out of the loop. And it seems like you renamednull_counter
tounknown_counter
, but forgot to update thecounter.get
line. \$\endgroup\$Graipher– Graipher2020年03月14日 11:05:46 +00:00Commented Mar 14, 2020 at 11:05 -
1\$\begingroup\$ @Graipher, You're absolutely correct--revised. It could be a global constant as well, or built dynamically by searching the module for functions with names like "*_counter". \$\endgroup\$RootTwo– RootTwo2020年03月14日 17:02:24 +00:00Commented Mar 14, 2020 at 17:02