CSV file cell slicer

Question 1

This program is supposed to find and slice any cell within a CSV file which contains more than a defined number of characters.

The files can be pretty large, so being a beginner I would like to know if it is written correctly, and if I could make it more efficient. One of the lines is also more than 80 characters long, which bothers me.

import configparser, csv, sys
if len(sys.argv) < 3 :
 usage = """Usage: %s [inputfile][output file]\nThis program requires 2 \
arguments to function properly.\n[input file] is the file to clean\n[output fil\
e] is the name of the file that will be created as a result of this program\n"""
 print(usage % (sys.argv[0]))
else :
 #reads the config file
 config = configparser.ConfigParser()
 config.read('csv_cleaner.ini')
 config = config['CONFIG']
 encoding = config['character_encoding']
 size = int(config['truncation_size'])
 #opens target file and creates the receiving one
 with open(sys.argv[1], 'r', newline='', encoding=encoding)as csv_file, \
 open(sys.argv[2],'x', newline='', encoding=encoding)as output_file:
 #helps with parsing
 if config['detect_dialect'] :
 dialect = csv.Sniffer().sniff(csv_file.read(2048))
 dialect.escapechar = '\\'
 #return to beginning of file
 csv_file.seek(0)
 #creates reader and writer
 reader = csv.reader(csv_file, dialect)
 dialect.delimiter = config['delimiter_in_output']
 writer = csv.writer(output_file, dialect)
 #loops through file's lines
 for row in reader :
 #slices cells and loops through line's columns
 row=[col[:(size-2)]+(col[(size-2):]and'..')for col in row]
 #writes in new file
 writer.writerow(row)

This program uses a config file :

[CONFIG]
character_encoding = UTF-8
delimiter_in_output = ;
#set this option to False if the file is not recognized
detect_dialect = True
truncation_size = 255

Question 2

The main improvement you can get is separating your concerns. Currently parsing the commandline, parsing the config file and actually truncating the columns is all mashed up together. Instead write short functions for each of these things:

import configparser
import csv
import sys
USAGE = """
Usage: %s [inputfile] [output file]
This program requires 2 arguments to function properly.
[input file] is the file to clean
[output file] is the name of the file that will be created as a result of this program
"""
def read_config(file_name):
 config = configparser.ConfigParser()
 config.read(file_name)
 return config['CONFIG']
def detect_dialect_from_file(csv_file):
 dialect = csv.Sniffer().sniff(csv_file.read(2048))
 dialect.escapechar = '\\'
 # return to beginning of file
 csv_file.seek(0)
 return dialect
def truncate_row(row, size):
 return [col if len(col) <= size else col[:size - 2] + ".." for col in row]
def truncate_cells(csv_file_name, output_file_name, encoding, size,\
 output_delimiter, detect_dialect, **kwargs):
 # opens target file and creates the receiving one
 with open(csv_file_name, 'r', newline='', encoding=encoding) as csv_file,\
 open(output_file_name, 'x', newline='', encoding=encoding) as output_file:
 # helps with parsing
 dialect = (detect_dialect_from_file(csv_file)
 if detect_dialect
 else "excel")
 reader = csv.reader(csv_file, dialect)
 dialect.delimiter = output_delimiter
 writer = csv.writer(output_file, dialect)
 # loops through file's lines
 for row in reader:
 # writes in new file
 writer.writerow(truncate_row(row, size))
def main():
 print(len(sys.argv))
 if len(sys.argv) < 3:
 print(usage % (sys.argv[0]))
 sys.exit(1)
 # reads the config file
 config = read_config('csv_cleaner.ini')
 config['size'] = int(config['truncation_size'])
 config['detect_dialect'] = bool(config['detect_dialect'])
 truncate_cells(sys.argv[1], sys.argv[2], **config)
if __name__ == "__main__":
 main()

Here I made the following additional changes:

According to Python's official style-guide, PEP8, lines should be indented by 4 spaces. Also spaces should surround operators and keywords. Imports should also be on separate lines.
While [col[:(size-2)]+(col[(size-2):]and'..')for col in row] works, I find [col if len(col) <= size else col[:size - 2] + ".." for col in row] a bit more readable.
When detect_dialect is false in your case, the script should not run (since it is all nested under that if). But since the value is a string, it is truthy, regardless of whether or not you put True or False there.
When using a multiple line string using triple-quotes, there is no need to use explicit line-breaks and line-continuation characters. I made it a global constant here to avoid indentation issues.

Note that I (or rather my linter) think that 'x' is not a valid file mode.

Question 3

That's a lot of edit, thanks for those improvements. I still have a few questions. Why have all the imports on separated lines? I think it doesn't improve readability and reduces information density. Why put the usage variable out of the condition that displays it ? It forces the program to create a value that will perhaps not be used.

Question 4

@Comte_Zero: The authors of PEP 8 think it does improve readability. The difference is not that great for three imports but if you have import argparse, pandas as pd, itertools, logging, numpy as np, os, pickle all in one line, it gets messy. And that example is only the first half of imports in one of my scripts.

Graipher Graipher 41.6k7 gold badges70 silver badges134 bronze badges · Answer 1 · 2018-09-07 10:25:31Z

The main improvement you can get is separating your concerns. Currently parsing the commandline, parsing the config file and actually truncating the columns is all mashed up together. Instead write short functions for each of these things:

import configparser
import csv
import sys
USAGE = """
Usage: %s [inputfile] [output file]
This program requires 2 arguments to function properly.
[input file] is the file to clean
[output file] is the name of the file that will be created as a result of this program
"""
def read_config(file_name):
 config = configparser.ConfigParser()
 config.read(file_name)
 return config['CONFIG']
def detect_dialect_from_file(csv_file):
 dialect = csv.Sniffer().sniff(csv_file.read(2048))
 dialect.escapechar = '\\'
 # return to beginning of file
 csv_file.seek(0)
 return dialect
def truncate_row(row, size):
 return [col if len(col) <= size else col[:size - 2] + ".." for col in row]
def truncate_cells(csv_file_name, output_file_name, encoding, size,\
 output_delimiter, detect_dialect, **kwargs):
 # opens target file and creates the receiving one
 with open(csv_file_name, 'r', newline='', encoding=encoding) as csv_file,\
 open(output_file_name, 'x', newline='', encoding=encoding) as output_file:
 # helps with parsing
 dialect = (detect_dialect_from_file(csv_file)
 if detect_dialect
 else "excel")
 reader = csv.reader(csv_file, dialect)
 dialect.delimiter = output_delimiter
 writer = csv.writer(output_file, dialect)
 # loops through file's lines
 for row in reader:
 # writes in new file
 writer.writerow(truncate_row(row, size))
def main():
 print(len(sys.argv))
 if len(sys.argv) < 3:
 print(usage % (sys.argv[0]))
 sys.exit(1)
 # reads the config file
 config = read_config('csv_cleaner.ini')
 config['size'] = int(config['truncation_size'])
 config['detect_dialect'] = bool(config['detect_dialect'])
 truncate_cells(sys.argv[1], sys.argv[2], **config)
if __name__ == "__main__":
 main()

Here I made the following additional changes:

According to Python's official style-guide, PEP8, lines should be indented by 4 spaces. Also spaces should surround operators and keywords. Imports should also be on separate lines.
While [col[:(size-2)]+(col[(size-2):]and'..')for col in row] works, I find [col if len(col) <= size else col[:size - 2] + ".." for col in row] a bit more readable.
When detect_dialect is false in your case, the script should not run (since it is all nested under that if). But since the value is a string, it is truthy, regardless of whether or not you put True or False there.
When using a multiple line string using triple-quotes, there is no need to use explicit line-breaks and line-continuation characters. I made it a global constant here to avoid indentation issues.

Note that I (or rather my linter) think that 'x' is not a valid file mode.

That's a lot of edit, thanks for those improvements. I still have a few questions. Why have all the imports on separated lines? I think it doesn't improve readability and reduces information density. Why put the usage variable out of the condition that displays it ? It forces the program to create a value that will perhaps not be used.
@Comte_Zero: The authors of PEP 8 think it does improve readability. The difference is not that great for three imports but if you have import argparse, pandas as pd, itertools, logging, numpy as np, os, pickle all in one line, it gets messy. And that example is only the first half of imports in one of my scripts.

Stack Exchange Network

CSV file cell slicer

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CSV file cell slicer

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions