3
\$\begingroup\$

I wrote a simple command-line script to count all words in all documents (of the supported formats) in the current directory. Currently it supports TXT, DOCX, XLSX, and PDF formats and has been satisfactorily tested with them. As a freelance translator and content writer, this script provides me with an excellent tool, to quickly evaluate the scope of large projects by simply "dropping" the script into a directory and running it from PowerShell/Terminal.

Currently tested in Windows 10 only.

What do you think of this script? What should I improve?

import os
import openpyxl
import platform
import docx2txt
import PyPDF2
def current_dir():
 if platform.system() == "Windows":
 directory = os.listdir(".\\")
 else:
 directory = os.getcwd()
 return directory
def excel_counter(filename):
 count = 0
 wb = openpyxl.load_workbook(filename)
 for sheet in wb:
 for row in sheet:
 for cell in row:
 text = str(cell.value)
 if text != "None":
 word_list = text.split()
 count += len(word_list)
 return count
def pdf_counter(filename):
 pdf_word_count = 0
 pdfFileObj = open(filename, "rb")
 pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
 number_of_pages = pdfReader.getNumPages() - 1
 for page in range(0, number_of_pages + 1):
 page_contents = pdfReader.getPage(page - 1)
 raw_text = page_contents.extractText()
 text = raw_text.encode('utf-8')
 page_word_count = len(text.split())
 pdf_word_count += page_word_count
 return pdf_word_count
def main():
 word_count = 0
 print(f"Current Directory: {os.getcwd()}")
 for file in current_dir():
 file_name_list = os.path.splitext(file)
 extension = file_name_list[1]
 if extension == ".xlsx":
 current_count = excel_counter(file)
 print(f"{file} {current_count}")
 word_count += current_count
 if extension == ".docx":
 text = docx2txt.process(file)
 current_count = len(text.split())
 print(f"{file} {current_count}")
 word_count += current_count
 if extension == ".txt":
 f = open(file, "r")
 text = f.read()
 current_count = len(text.split())
 print(f"{file} {current_count}")
 word_count += current_count
 if extension == ".pdf":
 pdf_word_count = pdf_counter(file)
 print(f"{file} {pdf_word_count}")
 word_count += pdf_word_count
 else:
 pass
 print(f"Total: {word_count}")
main()
Ola Ström
1571 gold badge3 silver badges10 bronze badges
asked Mar 13, 2020 at 16:48
\$\endgroup\$
1
  • \$\begingroup\$ What do you consider to be a word? \$\endgroup\$ Commented Mar 13, 2020 at 22:50

2 Answers 2

3
\$\begingroup\$

I recommend using pathlib and Path objects instead of os, and you should use a context manager when manipulating files (e.g. with open("file.txt", "r") as file: ...). You also have a lot of repeating code when you're checking extensions, and you keep checking the rest of the if statements even if it's matched an earlier one. And the final else: pass does literally nothing so just remove that.

You could also do something about your nested for loops for sheet, row and cell (you'd typically use zip or itertools.product then) but this is kind of readable and nice so I'm not sure it's worth the conversion.

answered Mar 13, 2020 at 20:15
\$\endgroup\$
0
3
\$\begingroup\$

Refactor the code so each file type gets it's own function for counting the words like excel_counter() and pdf_counter(). Then use a dict to map file extensions to the functions.

Something like:

def docx_counter(file):
 text = docx2txt.process(file)
 return len(text.split())
def txt_counter(file):
 f = open(file, "r")
 text = f.read()
 return len(text.split())
def unknown_counter(file):
 print(f"Don't know how to process {file}.")
 return 0
def main():
 word_count = 0
 print(f"Current Directory: {os.getcwd()}")
 counter = {
 ".xlsx":excel_counter,
 ".docx":docx_counter,
 ".txt":txt_counter,
 ".pdf":pdf_counter
 }
 for file in current_dir():
 file_name_list = os.path.splitext(file)
 extension = file_name_list[1]
 current_count = counter.get(extension, null_counter)(file)
 print(f"{file} {current_count}")
 word_count += current_count
 print(f"Total: {word_count}")
answered Mar 14, 2020 at 3:18
\$\endgroup\$
2
  • \$\begingroup\$ Redefining the counter dictionary for every file seems like a waste. Jut move it out of the loop. And it seems like you renamed null_counter to unknown_counter, but forgot to update the counter.get line. \$\endgroup\$ Commented Mar 14, 2020 at 11:05
  • 1
    \$\begingroup\$ @Graipher, You're absolutely correct--revised. It could be a global constant as well, or built dynamically by searching the module for functions with names like "*_counter". \$\endgroup\$ Commented Mar 14, 2020 at 17:02

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.