Python word counter for all files in the current directory

Question 1

I wrote a simple command-line script to count all words in all documents (of the supported formats) in the current directory. Currently it supports TXT, DOCX, XLSX, and PDF formats and has been satisfactorily tested with them. As a freelance translator and content writer, this script provides me with an excellent tool, to quickly evaluate the scope of large projects by simply "dropping" the script into a directory and running it from PowerShell/Terminal.

Currently tested in Windows 10 only.

What do you think of this script? What should I improve?

import os
import openpyxl
import platform
import docx2txt
import PyPDF2
def current_dir():
 if platform.system() == "Windows":
 directory = os.listdir(".\\")
 else:
 directory = os.getcwd()
 return directory
def excel_counter(filename):
 count = 0
 wb = openpyxl.load_workbook(filename)
 for sheet in wb:
 for row in sheet:
 for cell in row:
 text = str(cell.value)
 if text != "None":
 word_list = text.split()
 count += len(word_list)
 return count
def pdf_counter(filename):
 pdf_word_count = 0
 pdfFileObj = open(filename, "rb")
 pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
 number_of_pages = pdfReader.getNumPages() - 1
 for page in range(0, number_of_pages + 1):
 page_contents = pdfReader.getPage(page - 1)
 raw_text = page_contents.extractText()
 text = raw_text.encode('utf-8')
 page_word_count = len(text.split())
 pdf_word_count += page_word_count
 return pdf_word_count
def main():
 word_count = 0
 print(f"Current Directory: {os.getcwd()}")
 for file in current_dir():
 file_name_list = os.path.splitext(file)
 extension = file_name_list[1]
 if extension == ".xlsx":
 current_count = excel_counter(file)
 print(f"{file} {current_count}")
 word_count += current_count
 if extension == ".docx":
 text = docx2txt.process(file)
 current_count = len(text.split())
 print(f"{file} {current_count}")
 word_count += current_count
 if extension == ".txt":
 f = open(file, "r")
 text = f.read()
 current_count = len(text.split())
 print(f"{file} {current_count}")
 word_count += current_count
 if extension == ".pdf":
 pdf_word_count = pdf_counter(file)
 print(f"{file} {pdf_word_count}")
 word_count += pdf_word_count
 else:
 pass
 print(f"Total: {word_count}")
main()

Question 2

What do you consider to be a word?

Question 3

I recommend using pathlib and Path objects instead of os, and you should use a context manager when manipulating files (e.g. with open("file.txt", "r") as file: ...). You also have a lot of repeating code when you're checking extensions, and you keep checking the rest of the if statements even if it's matched an earlier one. And the final else: pass does literally nothing so just remove that.

You could also do something about your nested for loops for sheet, row and cell (you'd typically use zip or itertools.product then) but this is kind of readable and nice so I'm not sure it's worth the conversion.

Question 4

Refactor the code so each file type gets it's own function for counting the words like excel_counter() and pdf_counter(). Then use a dict to map file extensions to the functions.

Something like:

def docx_counter(file):
 text = docx2txt.process(file)
 return len(text.split())
def txt_counter(file):
 f = open(file, "r")
 text = f.read()
 return len(text.split())
def unknown_counter(file):
 print(f"Don't know how to process {file}.")
 return 0
def main():
 word_count = 0
 print(f"Current Directory: {os.getcwd()}")
 counter = {
 ".xlsx":excel_counter,
 ".docx":docx_counter,
 ".txt":txt_counter,
 ".pdf":pdf_counter
 }
 for file in current_dir():
 file_name_list = os.path.splitext(file)
 extension = file_name_list[1]
 current_count = counter.get(extension, null_counter)(file)
 print(f"{file} {current_count}")
 word_count += current_count
 print(f"Total: {word_count}")

Question 5

Redefining the counter dictionary for every file seems like a waste. Jut move it out of the loop. And it seems like you renamed null_counter to unknown_counter, but forgot to update the counter.get line.

Question 6

@Graipher, You're absolutely correct--revised. It could be a global constant as well, or built dynamically by searching the module for functions with names like "*_counter".

ades ades 1,3917 silver badges16 bronze badges · Answer 1 · 2020-03-13 20:15:59Z

I recommend using pathlib and Path objects instead of os, and you should use a context manager when manipulating files (e.g. with open("file.txt", "r") as file: ...). You also have a lot of repeating code when you're checking extensions, and you keep checking the rest of the if statements even if it's matched an earlier one. And the final else: pass does literally nothing so just remove that.

You could also do something about your nested for loops for sheet, row and cell (you'd typically use zip or itertools.product then) but this is kind of readable and nice so I'm not sure it's worth the conversion.

RootTwo RootTwo 10.6k1 gold badge14 silver badges30 bronze badges · Answer 2 · 2020-03-14 03:18:58Z

Refactor the code so each file type gets it's own function for counting the words like excel_counter() and pdf_counter(). Then use a dict to map file extensions to the functions.

Something like:

def docx_counter(file):
 text = docx2txt.process(file)
 return len(text.split())
def txt_counter(file):
 f = open(file, "r")
 text = f.read()
 return len(text.split())
def unknown_counter(file):
 print(f"Don't know how to process {file}.")
 return 0
def main():
 word_count = 0
 print(f"Current Directory: {os.getcwd()}")
 counter = {
 ".xlsx":excel_counter,
 ".docx":docx_counter,
 ".txt":txt_counter,
 ".pdf":pdf_counter
 }
 for file in current_dir():
 file_name_list = os.path.splitext(file)
 extension = file_name_list[1]
 current_count = counter.get(extension, null_counter)(file)
 print(f"{file} {current_count}")
 word_count += current_count
 print(f"Total: {word_count}")

Redefining the counter dictionary for every file seems like a waste. Jut move it out of the loop. And it seems like you renamed null_counter to unknown_counter, but forgot to update the counter.get line.
@Graipher, You're absolutely correct--revised. It could be a global constant as well, or built dynamically by searching the module for functions with names like "*_counter".

Stack Exchange Network

Python word counter for all files in the current directory

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python word counter for all files in the current directory

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions