This program is supposed to find and slice any cell within a CSV file which contains more than a defined number of characters.
The files can be pretty large, so being a beginner I would like to know if it is written correctly, and if I could make it more efficient. One of the lines is also more than 80 characters long, which bothers me.
import configparser, csv, sys
if len(sys.argv) < 3 :
usage = """Usage: %s [inputfile][output file]\nThis program requires 2 \
arguments to function properly.\n[input file] is the file to clean\n[output fil\
e] is the name of the file that will be created as a result of this program\n"""
print(usage % (sys.argv[0]))
else :
#reads the config file
config = configparser.ConfigParser()
config.read('csv_cleaner.ini')
config = config['CONFIG']
encoding = config['character_encoding']
size = int(config['truncation_size'])
#opens target file and creates the receiving one
with open(sys.argv[1], 'r', newline='', encoding=encoding)as csv_file, \
open(sys.argv[2],'x', newline='', encoding=encoding)as output_file:
#helps with parsing
if config['detect_dialect'] :
dialect = csv.Sniffer().sniff(csv_file.read(2048))
dialect.escapechar = '\\'
#return to beginning of file
csv_file.seek(0)
#creates reader and writer
reader = csv.reader(csv_file, dialect)
dialect.delimiter = config['delimiter_in_output']
writer = csv.writer(output_file, dialect)
#loops through file's lines
for row in reader :
#slices cells and loops through line's columns
row=[col[:(size-2)]+(col[(size-2):]and'..')for col in row]
#writes in new file
writer.writerow(row)
This program uses a config file :
[CONFIG]
character_encoding = UTF-8
delimiter_in_output = ;
#set this option to False if the file is not recognized
detect_dialect = True
truncation_size = 255
1 Answer 1
The main improvement you can get is separating your concerns. Currently parsing the commandline, parsing the config file and actually truncating the columns is all mashed up together. Instead write short functions for each of these things:
import configparser
import csv
import sys
USAGE = """
Usage: %s [inputfile] [output file]
This program requires 2 arguments to function properly.
[input file] is the file to clean
[output file] is the name of the file that will be created as a result of this program
"""
def read_config(file_name):
config = configparser.ConfigParser()
config.read(file_name)
return config['CONFIG']
def detect_dialect_from_file(csv_file):
dialect = csv.Sniffer().sniff(csv_file.read(2048))
dialect.escapechar = '\\'
# return to beginning of file
csv_file.seek(0)
return dialect
def truncate_row(row, size):
return [col if len(col) <= size else col[:size - 2] + ".." for col in row]
def truncate_cells(csv_file_name, output_file_name, encoding, size,\
output_delimiter, detect_dialect, **kwargs):
# opens target file and creates the receiving one
with open(csv_file_name, 'r', newline='', encoding=encoding) as csv_file,\
open(output_file_name, 'x', newline='', encoding=encoding) as output_file:
# helps with parsing
dialect = (detect_dialect_from_file(csv_file)
if detect_dialect
else "excel")
reader = csv.reader(csv_file, dialect)
dialect.delimiter = output_delimiter
writer = csv.writer(output_file, dialect)
# loops through file's lines
for row in reader:
# writes in new file
writer.writerow(truncate_row(row, size))
def main():
print(len(sys.argv))
if len(sys.argv) < 3:
print(usage % (sys.argv[0]))
sys.exit(1)
# reads the config file
config = read_config('csv_cleaner.ini')
config['size'] = int(config['truncation_size'])
config['detect_dialect'] = bool(config['detect_dialect'])
truncate_cells(sys.argv[1], sys.argv[2], **config)
if __name__ == "__main__":
main()
Here I made the following additional changes:
- According to Python's official style-guide, PEP8, lines should be indented by 4 spaces. Also spaces should surround operators and keywords. Imports should also be on separate lines.
- While
[col[:(size-2)]+(col[(size-2):]and'..')for col in row]
works, I find[col if len(col) <= size else col[:size - 2] + ".." for col in row]
a bit more readable. - When
detect_dialect
is false in your case, the script should not run (since it is all nested under that if). But since the value is a string, it is truthy, regardless of whether or not you putTrue
orFalse
there. - When using a multiple line string using triple-quotes, there is no need to use explicit line-breaks and line-continuation characters. I made it a global constant here to avoid indentation issues.
Note that I (or rather my linter) think that 'x'
is not a valid file mode.
-
\$\begingroup\$ That's a lot of edit, thanks for those improvements. I still have a few questions. Why have all the imports on separated lines? I think it doesn't improve readability and reduces information density. Why put the usage variable out of the condition that displays it ? It forces the program to create a value that will perhaps not be used. \$\endgroup\$Comte_Zero– Comte_Zero2018年09月07日 11:05:23 +00:00Commented Sep 7, 2018 at 11:05
-
1\$\begingroup\$ @Comte_Zero: The authors of PEP 8 think it does improve readability. The difference is not that great for three imports but if you have
import argparse, pandas as pd, itertools, logging, numpy as np, os, pickle
all in one line, it gets messy. And that example is only the first half of imports in one of my scripts. \$\endgroup\$Graipher– Graipher2018年09月07日 11:11:47 +00:00Commented Sep 7, 2018 at 11:11