I have been working on a project where I needed to analyze multiple, large datasets contained inside many CSV files at the same time. I am not a programmer but an engineer, so I did a lot of searching and reading. Python's stock CSV module provides the basic functionality, but I had a lot of trouble getting the methods to run quickly on 50k-500k rows since many strategies were simply appending. I had lots of problems getting what I wanted and I saw the same questions asked over and over again. I decided to spend some time and write a class that performed these functions and would be portable. If nothing else, myself and other people I work with could use it.
I would like some input on the class and any suggestions you may have. I am not a programmer and don't have any formal background so this has been a good OOP intro for me. The end result is in two lines you can read all CSV files in a folder into memory as either pure Python lists or, as lists of NumPy arrays. I have tested it in many scenarios and hopefully found most of the bugs. I'd like to think this is good enough that other people can just copy and paste into their code and move on to the more important stuff. I am open to all critiques and suggestions. Is this something you could use? If not, why?
You can try it with generic CSV data. The standard Python lists are flexible in size and data type. NumPy will only work with numeric (float specifically) data that is rectangular in format:
x, y, z,
1, 2, 3,
4, 5, 6,
...
import numpy as np
import csv
import os
import sys
class EasyCSV(object):
"""Easily open from and save CSV files using lists or numpy arrays.
Initiating and using the class is as easy as CSV = EasyCSV('location').
The class takes the following arguements:
EasyCSV(location, width=None, np_array='false', skip_rows=0)
location is the only mandatory field and is string of the folder location
containing .CSV file(s).
width is optional and specifies a constant width. The default value None
will return a list of lists with variable width. When used with numpy the
array will have the dimensions of the first valid numeric row of data.
np_array will create a fixed-width numpy array of only float values.
skip_rows will skip the specified rows at the top of the file.
"""
def __init__(self, location, width=None, np_array='false', skip_rows=0):
# Initialize default vairables
self.np_array = np_array
self.skip_rows = skip_rows
self.loc = str(location)
os.chdir(self.loc)
self.dataFiles = []
self.width = width
self.i = 0
#Find all CSV files in chosen directory.
for files in os.listdir(loc):
if files.endswith('CSV') or files.endswith('csv'):
self.dataFiles.append(files)
#Preallocate array to hold csv data later
self.allData = [0] * len(self.dataFiles)
def read(self,):
'''Reads all files contained in the folder into memory.
'''
self.Dict = {} #Stores names of files for later lookup
#Main processig loop
for files in self.dataFiles:
self.trim = 0
self.j = 0
with open(files,'rb') as self.rawFile:
print files
#Read in CSV object
self.newData = csv.reader(self.rawFile)
self.dataList = []
#Extend iterates through CSV object and passes to datalist
self.dataList.extend(self.newData)
#Trims off pre specified lines at the top
if self.skip_rows != 0:
self.dataList = self.dataList[self.skip_rows:]
#Numpy route, requires all numeric input
if self.np_array == 'true':
#Finds width if not specified
if self.width is None:
self.width = len(self.dataList[self.skip_rows])
self.CSVdata = np.zeros((len(self.dataList),self.width))
#Iterate through data and adds it to numpy array
self.k = 0
for data in self.dataList:
try:
self.CSVdata[self.j,:] = data
self.j+=1
except ValueError: #non numeric data
if self.width < len(data):
sys.exit('Numpy array too narrow. Choose another width')
self.trim+=1
pass
self.k+=1
#trims off excess
if not self.trim == 0:
self.CSVdata = self.CSVdata[:-self.trim]
#Python nested lists route; tolerates multiple data types
else:
#Declare required empty str arrays
self.CSVdata = [0]*len(self.dataList)
for rows in self.dataList:
self.k = 0
self.rows = rows
#Handle no width imput, flexible width
if self.width is None:
self.numrow = [0]*len(self.rows)
else:
self.numrow = [0]*self.width
#Try to convert to float, fall back on string.
for data in self.rows:
try:
self.numrow[self.k] = float(data)
except ValueError:
try:
self.numrow[self.k] = data
except IndexError:
pass
except IndexError:
pass
self.k+=1
self.CSVdata[self.j] = self.numrow
self.j+=1
#append file to allData which contains all files
self.allData[self.i] = self.CSVdata
#trim CSV off filename and store in Dict for indexing of allData
self.dataFiles[self.i] = self.dataFiles[self.i][:-4]
self.Dict[self.dataFiles[self.i]] = self.i
self.i+=1
def write(self, array, name, destination=None):
'''Writes array in memory to file.
EasyCSV.write(array, name, destination=None)
array is a pointer to the array you want written to CSV
name will be the name of said file
destination is optional and will change the directory to the location
specified. Leaving it at the default value None will overwrite any CSVs
that may have been read in by the class earlier.
'''
self.array = array
self.name = name
self.dest = destination
#Optional change directory
if self.dest is not None:
os.chdir(self.dest)
#Dict does not hold CSV, check to see if present and trim
if not self.name[-4:] == '.CSV' or self.name[-4:] == '.csv':
self.name = name + '.CSV'
#Create files and write data, 'wb' binary req'd for Win compatibility
with open(self.name,'wb') as self.newCSV:
self.CSVwrite = csv.writer(self.newCSV,dialect='excel')
for data in self.array:
self.CSVwrite.writerow(data)
os.chdir(self.loc) #Change back to original __init__.loc
def lookup(self, key=None):
'''Prints a preview of data to the console window with just a key input
'''
self.key = key
#Dict does not hold CSV, check to see if present and trim
if self.key[-4:] == '.CSV' or self.key[-4:] == '.csv':
self.key = key[:-4]
#Print None case
elif self.key is None:
print self.allData[0]
print self.allData[0]
print '... ' * len(self.allData[0][-2])
print self.allData[0][-2]
print self.allData[0]
#Print everything else
else:
self.index = self.Dict[self.key]
print self.allData[self.index][0]
print self.allData[self.index][1]
print '... ' * len(self.allData[self.index][-2])
print self.allData[self.index][-2]
print self.allData[self.index][-1]
def output(self, key=None):
'''Returns the array for assignment to a var with just a key input
'''
self.key = key
#Dict does not hold CSV, check to see if present and trim
if self.key is None:
return self.allData[0]
elif self.key[-4:] == '.CSV' or self.key[-4:] == '.csv':
self.key = key[:-4]
#Return file requested
self.index = self.Dict[self.key]
return self.allData[self.Dict[self.key]]
################################################
loc = 'C:\Users\Me\Desktop'
CSV = EasyCSV(loc, np_array='false', width=None, skip_rows=0)
CSV.read()
target = 'somecsv' #with or without .csv/.CSV
CSV.lookup(target)
A = CSV.output(target)
loc2 = 'C:\Users\Me\Desktop\New folder'
for keys in CSV.Dict:
print keys
CSV.write(CSV.output(keys),keys,destination=loc2)
2 Answers 2
Some observations:
- You expect
read
to be called exactly once (otherwise it reads the same files again, right?). You might as well call it from__init__
directly. Alternatively,read
could takelocation
as parameter, so one could read multiple directories into the object. - You use strings
'true', 'false'
where you should use actualbool
valuesTrue, False
- You set instance variables such as
self.key = key
that you use only locally inside the function, where you could simply use the local variablekey
. - The
read
method is very long. Divide the work into smaller functions and call them fromread
. - You have docstrings and a fair amount of comments, good. But then you have really cryptic statements such as
self.i = 0
. - Some variable names are misleading, such as
files
which is actually a single filename. - Don't change the working directory (
os.chdir
). Useos.path.join(loc, filename)
to construct paths. (If you think it's OK to change it, think what happens if you combine this module with some other module that also thinks it's OK)
-
\$\begingroup\$ This was exactly the sort of stuff I was looking for. Thanks for taking the time to look through it. You hit on a lot of the issues that came up during. Allowing for separate read paths would be really helpful. Also I need to research more on the local vars for a function. I was having problems getting them to work and found it easier to declare a self.xyz. Tan \$\endgroup\$devfarce– devfarce2013年04月08日 15:05:13 +00:00Commented Apr 8, 2013 at 15:05
Janne's points are good. In addition:
When I try running this code, it fails:
>>> e = EasyCSV('.') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "cr24836.py", line 37, in __init__ for files in os.listdir(loc): NameError: global name 'loc' is not defined
I presume that
loc
is a typo forself.loc
. This makes me suspicious. Have you actually used or tested this code?The
width
andskip_rows
arguments to the constructor apply to all CSV files in the directory. But isn't it likely that different CSV files will have different widths and need different numbers of rows to be skipped?Your class requires NumPy to be installed (otherwise the line
import numpy as np
will fail). But since it has a mode of operation that doesn't require NumPy (return lists instead), it would be nice if it worked even if NumPy is not installed. Wait until you're just about to callnp.zeros
before importing NumPy.location
is supposed to be the name of a directory, so name itdirectory
.You write
self.key[-4:] == '.CSV'
but why not use.endswith
like you did earlier in the program? Or better still, since you are testing this twice, write a function:def filename_is_csv(filename): """Return True if filename has the .csv extension.""" _, ext = os.path.splitext(filename) return ext.lower() == '.csv'
But having said that, do you really want to insist that this can only read CSV files whose names end with
.csv
? What if someone has CSV stored in a file namedfoo.data
? They'd never be able to read it with your class.There's nothing in the documentation for the class that explains that I am supposed to call the
read()
method. (If I don't, nothing happens.)There's nothing in the documentation for the class that explains how I am supposed to access the data that has been loaded into memory.
If I want to access the data for a filename, I have look up the filename in the
Dict
attribute to get the index, and then I could look up the index in theallData
attribute to get the data. Why this double lookup? Why not have a dictionary that maps filename to data instead of going via an index?There is no need to preallocate arrays in Python. Wait to create the array until you have some data to put in it, and then
append
each entry to it. Python is not Fortran!In your
read()
method, you read all the CSV files into memory. This seems wasteful. What if I had hundreds of files but only wanted to read one of them? Why not wait to read a file until the caller needs it?You convert numeric elements to floating-point numbers. This might not be what I want. For example, if I have a file containing:
Apollo,Launch 7,19681011 8,19681221 9,19690303 10,19690518 11,19690716 12,19691114 13,19700411 14,19710131 15,19710726 16,19720416 17,19721207
and then I try to read it, all the data has been wrongly converted to floating-point:
>>> e = EasyCSV('.') >>> e.read() apollo.csv >>> from pprint import pprint >>> pprint(e.allData[e.Dict['apollo']]) [['Apollo', 'Launch'], [7.0, 19681011.0], [8.0, 19681221.0], [9.0, 19690303.0], [10.0, 19690518.0], [11.0, 19690716.0], [12.0, 19691114.0], [13.0, 19700411.0], [14.0, 19710131.0], [15.0, 19710726.0], [16.0, 19720416.0], [17.0, 19721207.0]]
This can go wrong in other ways. For example, suppose I have a CSV file like this:
product code,inventory 1a0,81 7b4,61 9c2,32 8d3,90 1e9,95 2f4,71
When I read it with your class, look at what happens to the sixth row:
>>> e = EasyCSV('.') >>> e.read() inventory.csv >>> pprint(e.allData[e.Dict['inventory']]) [['product code', 'inventory'], ['1a0', 81.0], ['7b4', 61.0], ['9c2', 32.0], ['8d3', 90.0], [1000000000.0, 95.0], ['2f4', 71.0]]
You suggest that "other people can just copy and paste into their code" but this is never a good idea. How would you distribute bug fixes and other improvements? If you plan for other people to use your code, you should aim to make a package that can be distributed through the Python Package Index.
In summary, your class is misnamed: it does not seem to me as if it would be easy to use in practice.
Explore related questions
See similar questions with these tags.
csvkit
andpandas
, or maybe import CSVs into a relational or key-value database instead of using them directly. \$\endgroup\$