4
\$\begingroup\$

This Python script takes a directory of CSV files and calls a Scala script which tests whether a file's contents match a given regular expression. The link to this Scala script can be found in the 3rd line.

That script takes a single file as argument, hence my creation of this Python script to feed it an entire directory worth of files, while also utilizing the maximum possible amount of CPU power.

Do you see any issues or potential improvements in my code?

"""
Command line API to CSV validator using Scala implementation from:
http://digital-preservation.github.io/csv-validator/#toc7
"""
PATH_TO_VALIDATOR = r"C:\prog\csv\csv-validator-cmd-1.2-RC2\bin\validate.bat"
PATH_TO_CSV_FOLDER = r"C:\prog\csv\CSVFiles"
PATH_TO_CSV_SCHEMA = r"C:\prog\csv\ocr-schema.csvs"
# Set defaults
CSV_ENCODING = "windows-1252"
CSV_SCHEMA_ENCODING = "UTF-8"
def open_csv(CSV_LIST):
 import subprocess
 # To be used to display a simple progress indicator
 TOTAL_FILE_COUNT = len(CSV_LIST)
 current_file_count = 1
 with open("output.txt", 'w') as output:
 for filename in CSV_LIST:
 print("Processing file " + str(current_file_count) + "/" + str(TOTAL_FILE_COUNT))
 output.write(filename + ': ')
 validator = subprocess.Popen(
 [PATH_TO_VALIDATOR, PATH_TO_CSV_FOLDER + "/" + filename, PATH_TO_CSV_SCHEMA, "--csv-encoding",
 CSV_ENCODING, "--csv-schema-encoding", CSV_SCHEMA_ENCODING, '--fail-fast', 'true'], stdout=subprocess.PIPE)
 result = validator.stdout.read()
 output.write(result.decode('windows-1252'))
 current_file_count += 1
# Split a list into n sublists of roughly equal size
def split_list(alist, wanted_parts=1):
 length = len(alist)
 return [alist[i * length // wanted_parts: (i + 1) * length // wanted_parts]
 for i in range(wanted_parts)]
if __name__ == '__main__':
 import argparse
 import multiprocessing
 import os
 parser = argparse.ArgumentParser(description="Command line API to Scala CSV validator")
 parser.add_argument('-pv', '--PATH_TO_VALIDATOR', help="Specify the path to csv-validator-cmd/bin/validator.bat",
 required=True)
 parser.add_argument('-pf', '--PATH_TO_CSV_FOLDER', help="Specify the path to the folder containing the csv files "
 "you want to validate", required=True)
 parser.add_argument('-ps', '--PATH_TO_CSV_SCHEMA', help="Specify the path to CSV schema you want to use to "
 "validate the given files", required=True)
 parser.add_argument('-cenc', '--CSV_ENCODING', help="Optional parameter to specify the encoding used by the CSV "
 "files. Choose UTF-8 or windows-1252. Default windows-1252")
 parser.add_argument('-csenc', '--CSV_SCHEMA_ENCODING', help="Optional parameter to specify the encoding used by "
 "the CSV Schema. Choose UTF-8 or windows-1252. "
 "Default UTF-8")
 args = vars(parser.parse_args())
 if args['CSV_ENCODING'] is not None:
 CSV_ENCODING = args['CSV_ENCODING']
 if args['CSV_SCHEMA_ENCODING'] is not None:
 CSV_SCHEMA_ENCODING = args['CSV_SCHEMA_ENCODING']
 PATH_TO_VALIDATOR = args["PATH_TO_VALIDATOR"]
 PATH_TO_CSV_SCHEMA = args["PATH_TO_CSV_SCHEMA"]
 PATH_TO_CSV_FOLDER = args["PATH_TO_CSV_FOLDER"]
 CPU_COUNT = multiprocessing.cpu_count()
 split_csv_directory = split_list(os.listdir(args["PATH_TO_CSV_FOLDER"]), wanted_parts=CPU_COUNT)
 # Spawn a Process for each CPU on the system
 for csv_list in split_csv_directory:
 p = multiprocessing.Process(target=open_csv, args=(csv_list,))
 p.start()
Mast
13.8k12 gold badges56 silver badges127 bronze badges
asked Jan 27, 2018 at 18:09
\$\endgroup\$
1
  • 2
    \$\begingroup\$ Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers . Revising questions which have answers gets messy very fast. \$\endgroup\$ Commented Jan 29, 2018 at 11:27

1 Answer 1

1
\$\begingroup\$

Only a few small things I would suggest changing.

Pep8

You should consider formatting your code in accordance with pep8. This is important when sharing code, as the consistent style makes it much easier for other programmers to read your code. There are various tools available to assist in making the code pep8 compliant. I use the PyCharm IDE which will show pep8 violations right in the editor.

ALL_CAPS_IS_FOR_CONSTS

So generally ALL_CAPS is reserved for constant variables. So this sort of thing:

TOTAL_FILE_COUNT = len(CSV_LIST)

is more Pythonic as:

total_file_count = len(CSV_LIST)

And even better is getting rid of this intermediate assignment altogether with:

print("Processing file {}/{}".format(current_file_count, len(CSV_LIST)))

Testing for presence in dicts

Often when you find code checking for a key presence in a dict before performing some work:

if args['CSV_ENCODING'] is not None:
 CSV_ENCODING = args['CSV_ENCODING']

Basically the same thing can be done without the explicit check:

CSV_ENCODING = args.get('CSV_ENCODING', CSV_ENCODING)
answered Jan 29, 2018 at 6:15
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.