Many CSVs to list, optimize

Question 1

I have a Python3 script (running on Python 3.5.2) which reads 110 CSV files (tab delimited) into a list. The largest file is 20 MB and the list ends up looking like this:

[
 [line1],
 [line2],
 [line2021756],
 etc.
]

Right now the process takes about 32 seconds to complete:

python3 -m cProfile -s time script.py
10 files found
2021756 non-unique lines found.
9600828 function calls (9600673 primitive calls) in 32.900 seconds
 Ordered by: internal time
 ncalls tottime percall cumtime percall filename:lineno(function)
 1 30.451 30.451 32.890 32.890 script.py:81(main)
 7419324 1.945 0.000 1.945 0.000 {method 'startswith' of 'str' objects}
 2021994 0.189 0.000 0.189 0.000 {method 'append' of 'list' objects}
 76528 0.164 0.000 0.164 0.000 {built-in method _codecs.utf_8_decode}
 76528 0.130 0.000 0.294 0.000 codecs.py:318(decode)
 110 0.007 0.000 0.007 0.000 {built-in method io.open}
 5 0.002 0.000 0.002 0.000 {method 'read' of '_io.FileIO' objects}
 3 0.001 0.000 0.001 0.000 {built-in method _imp.create_dynamic}
 5 0.001 0.000 0.001 0.000 {built-in method marshal.loads}
 1 0.001 0.001 0.001 0.001 {built-in method posix.listdir}
 110 0.001 0.000 0.001 0.000 {built-in method _csv.reader}
 48 0.001 0.000 0.002 0.000 <frozen importlib._bootstrap_external>:1215(find_spec)
 111 0.001 0.000 0.001 0.000 {built-in method posix.lstat}
 332 0.001 0.000 0.001 0.000 posixpath.py:71(join)

... and I would like to know if there is a way to significantly reduce that time?. It seems startswith and append are the main bottlenecks.

script.py

# Find CSV files.
files_found = glob.glob('{0}dir_*_name/{1}'.format(input_dir,file_of_interest))
len_files_found = len(files_found)
if len_files_found == 0:
 print_message('Error: zero {0} files found'.format(file_of_interest), True)
print_message('{0} files found'.format(len_files_found), False)
# Read each file into files_found_lines.
# files_found_lines will look like [[line1],[line2],[line3],...]
files_found_lines = []
for file in files_found:
 try:
 # Open file for reading text.
 with open(file, 'rt', newline='', encoding='utf-8') as f:
 reader = csv.reader(f, delimiter='\t')
 for row in reader:
 # Keep lines starting with BLAH.
 if row[0].startswith('BLAH'):
 # Get first 9 columns.
 files_found_lines.append(row[0:9])
 except Exception as error:
 print_message('Error: {0}'.format(error), True)

It's possible I may end up reading each CSV file into its own list instead of one big list like above, so just FYI.

Question 2

I suggest that you tell us more about what the data look like, and what you intend to do with it once you have read it.

Question 3

It may be prudent to look at other ways of reading CSVs in python. For example, pandas read_csv function is much than csv reader

Question 4

Since you only want lines starting with BLAH, filter the lines before parsing the CSV. I think the profiler is misleading you by attributing the time spent parsing CSV to the line for row in reader in your code.

with open(file, 'rt', newline='', encoding='utf-8') as f:
 filtered = (line for line in f if line.startswith('BLAH'))
 reader = csv.reader(filtered, delimiter='\t')
 for row in reader:
 files_found_lines.append(row[0:9])

Janne Karila Janne Karila 10.6k21 silver badges34 bronze badges · Answer 1 · 2017-05-08 10:37:14Z

Since you only want lines starting with BLAH, filter the lines before parsing the CSV. I think the profiler is misleading you by attributing the time spent parsing CSV to the line for row in reader in your code.

with open(file, 'rt', newline='', encoding='utf-8') as f:
 filtered = (line for line in f if line.startswith('BLAH'))
 reader = csv.reader(filtered, delimiter='\t')
 for row in reader:
 files_found_lines.append(row[0:9])

Stack Exchange Network

Many CSVs to list, optimize

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Many CSVs to list, optimize

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions