I have a Python3 script (running on Python 3.5.2) which reads 110 CSV files (tab delimited) into a list. The largest file is 20 MB and the list ends up looking like this:
[
[line1],
[line2],
[line2021756],
etc.
]
Right now the process takes about 32 seconds to complete:
python3 -m cProfile -s time script.py
10 files found
2021756 non-unique lines found.
9600828 function calls (9600673 primitive calls) in 32.900 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 30.451 30.451 32.890 32.890 script.py:81(main)
7419324 1.945 0.000 1.945 0.000 {method 'startswith' of 'str' objects}
2021994 0.189 0.000 0.189 0.000 {method 'append' of 'list' objects}
76528 0.164 0.000 0.164 0.000 {built-in method _codecs.utf_8_decode}
76528 0.130 0.000 0.294 0.000 codecs.py:318(decode)
110 0.007 0.000 0.007 0.000 {built-in method io.open}
5 0.002 0.000 0.002 0.000 {method 'read' of '_io.FileIO' objects}
3 0.001 0.000 0.001 0.000 {built-in method _imp.create_dynamic}
5 0.001 0.000 0.001 0.000 {built-in method marshal.loads}
1 0.001 0.001 0.001 0.001 {built-in method posix.listdir}
110 0.001 0.000 0.001 0.000 {built-in method _csv.reader}
48 0.001 0.000 0.002 0.000 <frozen importlib._bootstrap_external>:1215(find_spec)
111 0.001 0.000 0.001 0.000 {built-in method posix.lstat}
332 0.001 0.000 0.001 0.000 posixpath.py:71(join)
... and I would like to know if there is a way to significantly reduce that time?. It seems startswith
and append
are the main bottlenecks.
script.py
# Find CSV files.
files_found = glob.glob('{0}dir_*_name/{1}'.format(input_dir,file_of_interest))
len_files_found = len(files_found)
if len_files_found == 0:
print_message('Error: zero {0} files found'.format(file_of_interest), True)
print_message('{0} files found'.format(len_files_found), False)
# Read each file into files_found_lines.
# files_found_lines will look like [[line1],[line2],[line3],...]
files_found_lines = []
for file in files_found:
try:
# Open file for reading text.
with open(file, 'rt', newline='', encoding='utf-8') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
# Keep lines starting with BLAH.
if row[0].startswith('BLAH'):
# Get first 9 columns.
files_found_lines.append(row[0:9])
except Exception as error:
print_message('Error: {0}'.format(error), True)
It's possible I may end up reading each CSV file into its own list instead of one big list like above, so just FYI.
-
3\$\begingroup\$ I suggest that you tell us more about what the data look like, and what you intend to do with it once you have read it. \$\endgroup\$200_success– 200_success2017年05月07日 22:46:29 +00:00Commented May 7, 2017 at 22:46
-
1\$\begingroup\$ It may be prudent to look at other ways of reading CSVs in python. For example, pandas read_csv function is much than csv reader \$\endgroup\$Bryce Guinta– Bryce Guinta2017年05月09日 03:56:25 +00:00Commented May 9, 2017 at 3:56
1 Answer 1
Since you only want lines starting with BLAH, filter the lines before parsing the CSV. I think the profiler is misleading you by attributing the time spent parsing CSV to the line for row in reader
in your code.
with open(file, 'rt', newline='', encoding='utf-8') as f:
filtered = (line for line in f if line.startswith('BLAH'))
reader = csv.reader(filtered, delimiter='\t')
for row in reader:
files_found_lines.append(row[0:9])