Portable Python CSV class

Question 1

I have been working on a project where I needed to analyze multiple, large datasets contained inside many CSV files at the same time. I am not a programmer but an engineer, so I did a lot of searching and reading. Python's stock CSV module provides the basic functionality, but I had a lot of trouble getting the methods to run quickly on 50k-500k rows since many strategies were simply appending. I had lots of problems getting what I wanted and I saw the same questions asked over and over again. I decided to spend some time and write a class that performed these functions and would be portable. If nothing else, myself and other people I work with could use it.

I would like some input on the class and any suggestions you may have. I am not a programmer and don't have any formal background so this has been a good OOP intro for me. The end result is in two lines you can read all CSV files in a folder into memory as either pure Python lists or, as lists of NumPy arrays. I have tested it in many scenarios and hopefully found most of the bugs. I'd like to think this is good enough that other people can just copy and paste into their code and move on to the more important stuff. I am open to all critiques and suggestions. Is this something you could use? If not, why?

You can try it with generic CSV data. The standard Python lists are flexible in size and data type. NumPy will only work with numeric (float specifically) data that is rectangular in format:

x, y, z,
1, 2, 3,
4, 5, 6,
...
import numpy as np
import csv
import os
import sys
class EasyCSV(object):
 """Easily open from and save CSV files using lists or numpy arrays.
 Initiating and using the class is as easy as CSV = EasyCSV('location').
 The class takes the following arguements:
 EasyCSV(location, width=None, np_array='false', skip_rows=0)
 location is the only mandatory field and is string of the folder location
 containing .CSV file(s).
 width is optional and specifies a constant width. The default value None
 will return a list of lists with variable width. When used with numpy the
 array will have the dimensions of the first valid numeric row of data.
 np_array will create a fixed-width numpy array of only float values.
 skip_rows will skip the specified rows at the top of the file.
 """
 def __init__(self, location, width=None, np_array='false', skip_rows=0):
 # Initialize default vairables
 self.np_array = np_array
 self.skip_rows = skip_rows
 self.loc = str(location)
 os.chdir(self.loc)
 self.dataFiles = []
 self.width = width
 self.i = 0
 #Find all CSV files in chosen directory.
 for files in os.listdir(loc):
 if files.endswith('CSV') or files.endswith('csv'):
 self.dataFiles.append(files)
 #Preallocate array to hold csv data later
 self.allData = [0] * len(self.dataFiles)
 def read(self,):
 '''Reads all files contained in the folder into memory.
 '''
 self.Dict = {} #Stores names of files for later lookup
 #Main processig loop
 for files in self.dataFiles:
 self.trim = 0
 self.j = 0
 with open(files,'rb') as self.rawFile:
 print files
 #Read in CSV object
 self.newData = csv.reader(self.rawFile)
 self.dataList = []
 #Extend iterates through CSV object and passes to datalist
 self.dataList.extend(self.newData)
 #Trims off pre specified lines at the top
 if self.skip_rows != 0:
 self.dataList = self.dataList[self.skip_rows:]
 #Numpy route, requires all numeric input
 if self.np_array == 'true':
 #Finds width if not specified
 if self.width is None:
 self.width = len(self.dataList[self.skip_rows])
 self.CSVdata = np.zeros((len(self.dataList),self.width))
 #Iterate through data and adds it to numpy array
 self.k = 0
 for data in self.dataList:
 try:
 self.CSVdata[self.j,:] = data
 self.j+=1
 except ValueError: #non numeric data
 if self.width < len(data):
 sys.exit('Numpy array too narrow. Choose another width')
 self.trim+=1
 pass
 self.k+=1
 #trims off excess
 if not self.trim == 0:
 self.CSVdata = self.CSVdata[:-self.trim]
 #Python nested lists route; tolerates multiple data types
 else:
 #Declare required empty str arrays
 self.CSVdata = [0]*len(self.dataList)
 for rows in self.dataList:
 self.k = 0
 self.rows = rows
 #Handle no width imput, flexible width
 if self.width is None:
 self.numrow = [0]*len(self.rows)
 else:
 self.numrow = [0]*self.width
 #Try to convert to float, fall back on string.
 for data in self.rows:
 try:
 self.numrow[self.k] = float(data)
 except ValueError:
 try:
 self.numrow[self.k] = data
 except IndexError:
 pass
 except IndexError:
 pass
 self.k+=1
 self.CSVdata[self.j] = self.numrow
 self.j+=1
 #append file to allData which contains all files
 self.allData[self.i] = self.CSVdata
 #trim CSV off filename and store in Dict for indexing of allData
 self.dataFiles[self.i] = self.dataFiles[self.i][:-4]
 self.Dict[self.dataFiles[self.i]] = self.i
 self.i+=1
 def write(self, array, name, destination=None):
 '''Writes array in memory to file.
 EasyCSV.write(array, name, destination=None)
 array is a pointer to the array you want written to CSV
 name will be the name of said file
 destination is optional and will change the directory to the location
 specified. Leaving it at the default value None will overwrite any CSVs
 that may have been read in by the class earlier.
 '''
 self.array = array
 self.name = name
 self.dest = destination
 #Optional change directory
 if self.dest is not None:
 os.chdir(self.dest)
 #Dict does not hold CSV, check to see if present and trim
 if not self.name[-4:] == '.CSV' or self.name[-4:] == '.csv':
 self.name = name + '.CSV'
 #Create files and write data, 'wb' binary req'd for Win compatibility
 with open(self.name,'wb') as self.newCSV:
 self.CSVwrite = csv.writer(self.newCSV,dialect='excel')
 for data in self.array:
 self.CSVwrite.writerow(data)
 os.chdir(self.loc) #Change back to original __init__.loc
 def lookup(self, key=None):
 '''Prints a preview of data to the console window with just a key input
 '''
 self.key = key
 #Dict does not hold CSV, check to see if present and trim
 if self.key[-4:] == '.CSV' or self.key[-4:] == '.csv':
 self.key = key[:-4]
 #Print None case
 elif self.key is None:
 print self.allData[0]
 print self.allData[0]
 print '... ' * len(self.allData[0][-2])
 print self.allData[0][-2]
 print self.allData[0]
 #Print everything else
 else:
 self.index = self.Dict[self.key]
 print self.allData[self.index][0]
 print self.allData[self.index][1]
 print '... ' * len(self.allData[self.index][-2])
 print self.allData[self.index][-2]
 print self.allData[self.index][-1]
 def output(self, key=None):
 '''Returns the array for assignment to a var with just a key input
 '''
 self.key = key
 #Dict does not hold CSV, check to see if present and trim
 if self.key is None:
 return self.allData[0]
 elif self.key[-4:] == '.CSV' or self.key[-4:] == '.csv':
 self.key = key[:-4]
 #Return file requested
 self.index = self.Dict[self.key]
 return self.allData[self.Dict[self.key]]
################################################
loc = 'C:\Users\Me\Desktop'
CSV = EasyCSV(loc, np_array='false', width=None, skip_rows=0)
CSV.read()
target = 'somecsv' #with or without .csv/.CSV
CSV.lookup(target)
A = CSV.output(target)
loc2 = 'C:\Users\Me\Desktop\New folder'
for keys in CSV.Dict:
 print keys
 CSV.write(CSV.output(keys),keys,destination=loc2)

Question 2

Also you should investigate csvkit and pandas, or maybe import CSVs into a relational or key-value database instead of using them directly.

Question 3

Pandas is interesting but not very light weight. I wish I saw a suggestion for it before. Can't use CSV kit because I'm primarily on Win. Thanks for the useful info.

Question 4

csvkit is supported on all platforms.

Question 5

Please use comments to respond to people rather than editing your question.

Question 6

Some observations:

You expect read to be called exactly once (otherwise it reads the same files again, right?). You might as well call it from __init__ directly. Alternatively, read could take location as parameter, so one could read multiple directories into the object.
You use strings 'true', 'false' where you should use actual bool values True, False
You set instance variables such as self.key = key that you use only locally inside the function, where you could simply use the local variable key.
The read method is very long. Divide the work into smaller functions and call them from read.
You have docstrings and a fair amount of comments, good. But then you have really cryptic statements such as self.i = 0.
Some variable names are misleading, such as files which is actually a single filename.
Don't change the working directory (os.chdir). Use os.path.join(loc, filename) to construct paths. (If you think it's OK to change it, think what happens if you combine this module with some other module that also thinks it's OK)

Question 7

This was exactly the sort of stuff I was looking for. Thanks for taking the time to look through it. You hit on a lot of the issues that came up during. Allowing for separate read paths would be really helpful. Also I need to research more on the local vars for a function. I was having problems getting them to work and found it easier to declare a self.xyz. Tan

Question 8

Janne's points are good. In addition:

When I try running this code, it fails:

>>> e = EasyCSV('.')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "cr24836.py", line 37, in __init__
 for files in os.listdir(loc):
NameError: global name 'loc' is not defined

I presume that loc is a typo for self.loc. This makes me suspicious. Have you actually used or tested this code?

The width and skip_rows arguments to the constructor apply to all CSV files in the directory. But isn't it likely that different CSV files will have different widths and need different numbers of rows to be skipped?
Your class requires NumPy to be installed (otherwise the line import numpy as np will fail). But since it has a mode of operation that doesn't require NumPy (return lists instead), it would be nice if it worked even if NumPy is not installed. Wait until you're just about to call np.zeros before importing NumPy.
location is supposed to be the name of a directory, so name it directory.
You write self.key[-4:] == '.CSV' but why not use .endswith like you did earlier in the program? Or better still, since you are testing this twice, write a function:
```
def filename_is_csv(filename):
 """Return True if filename has the .csv extension."""
 _, ext = os.path.splitext(filename)
 return ext.lower() == '.csv'
```
But having said that, do you really want to insist that this can only read CSV files whose names end with .csv? What if someone has CSV stored in a file named foo.data? They'd never be able to read it with your class.
There's nothing in the documentation for the class that explains that I am supposed to call the read() method. (If I don't, nothing happens.)
There's nothing in the documentation for the class that explains how I am supposed to access the data that has been loaded into memory.
If I want to access the data for a filename, I have look up the filename in the Dict attribute to get the index, and then I could look up the index in the allData attribute to get the data. Why this double lookup? Why not have a dictionary that maps filename to data instead of going via an index?
There is no need to preallocate arrays in Python. Wait to create the array until you have some data to put in it, and then append each entry to it. Python is not Fortran!
In your read() method, you read all the CSV files into memory. This seems wasteful. What if I had hundreds of files but only wanted to read one of them? Why not wait to read a file until the caller needs it?

You convert numeric elements to floating-point numbers. This might not be what I want. For example, if I have a file containing:

Apollo,Launch
7,19681011
8,19681221
9,19690303
10,19690518
11,19690716
12,19691114
13,19700411
14,19710131
15,19710726
16,19720416
17,19721207

and then I try to read it, all the data has been wrongly converted to floating-point:

>>> e = EasyCSV('.')
>>> e.read()
apollo.csv
>>> from pprint import pprint
>>> pprint(e.allData[e.Dict['apollo']])
[['Apollo', 'Launch'],
 [7.0, 19681011.0],
 [8.0, 19681221.0],
 [9.0, 19690303.0],
 [10.0, 19690518.0],
 [11.0, 19690716.0],
 [12.0, 19691114.0],
 [13.0, 19700411.0],
 [14.0, 19710131.0],
 [15.0, 19710726.0],
 [16.0, 19720416.0],
 [17.0, 19721207.0]]

This can go wrong in other ways. For example, suppose I have a CSV file like this:

product code,inventory
1a0,81
7b4,61
9c2,32
8d3,90
1e9,95
2f4,71

When I read it with your class, look at what happens to the sixth row:

>>> e = EasyCSV('.')
>>> e.read()
inventory.csv
>>> pprint(e.allData[e.Dict['inventory']])
[['product code', 'inventory'],
 ['1a0', 81.0],
 ['7b4', 61.0],
 ['9c2', 32.0],
 ['8d3', 90.0],
 [1000000000.0, 95.0],
 ['2f4', 71.0]]

You suggest that "other people can just copy and paste into their code" but this is never a good idea. How would you distribute bug fixes and other improvements? If you plan for other people to use your code, you should aim to make a package that can be distributed through the Python Package Index.

In summary, your class is misnamed: it does not seem to me as if it would be easy to use in practice.

Janne Karila Janne KarilaJanne Karila 10.6k21 silver badges34 bronze badges · Answer 1 · 2013-04-08 11:31:51Z

Some observations:

You expect read to be called exactly once (otherwise it reads the same files again, right?). You might as well call it from __init__ directly. Alternatively, read could take location as parameter, so one could read multiple directories into the object.
You use strings 'true', 'false' where you should use actual bool values True, False
You set instance variables such as self.key = key that you use only locally inside the function, where you could simply use the local variable key.
The read method is very long. Divide the work into smaller functions and call them from read.
You have docstrings and a fair amount of comments, good. But then you have really cryptic statements such as self.i = 0.
Some variable names are misleading, such as files which is actually a single filename.
Don't change the working directory (os.chdir). Use os.path.join(loc, filename) to construct paths. (If you think it's OK to change it, think what happens if you combine this module with some other module that also thinks it's OK)

This was exactly the sort of stuff I was looking for. Thanks for taking the time to look through it. You hit on a lot of the issues that came up during. Allowing for separate read paths would be really helpful. Also I need to research more on the local vars for a function. I was having problems getting them to work and found it easier to declare a self.xyz. Tan

score 5 · Answer 2 · 2013-04-08 15:56:02Z

Janne's points are good. In addition:

When I try running this code, it fails:

>>> e = EasyCSV('.')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "cr24836.py", line 37, in __init__
 for files in os.listdir(loc):
NameError: global name 'loc' is not defined

I presume that loc is a typo for self.loc. This makes me suspicious. Have you actually used or tested this code?

The width and skip_rows arguments to the constructor apply to all CSV files in the directory. But isn't it likely that different CSV files will have different widths and need different numbers of rows to be skipped?
Your class requires NumPy to be installed (otherwise the line import numpy as np will fail). But since it has a mode of operation that doesn't require NumPy (return lists instead), it would be nice if it worked even if NumPy is not installed. Wait until you're just about to call np.zeros before importing NumPy.
location is supposed to be the name of a directory, so name it directory.
You write self.key[-4:] == '.CSV' but why not use .endswith like you did earlier in the program? Or better still, since you are testing this twice, write a function:
```
def filename_is_csv(filename):
 """Return True if filename has the .csv extension."""
 _, ext = os.path.splitext(filename)
 return ext.lower() == '.csv'
```
But having said that, do you really want to insist that this can only read CSV files whose names end with .csv? What if someone has CSV stored in a file named foo.data? They'd never be able to read it with your class.
There's nothing in the documentation for the class that explains that I am supposed to call the read() method. (If I don't, nothing happens.)
There's nothing in the documentation for the class that explains how I am supposed to access the data that has been loaded into memory.
If I want to access the data for a filename, I have look up the filename in the Dict attribute to get the index, and then I could look up the index in the allData attribute to get the data. Why this double lookup? Why not have a dictionary that maps filename to data instead of going via an index?
There is no need to preallocate arrays in Python. Wait to create the array until you have some data to put in it, and then append each entry to it. Python is not Fortran!
In your read() method, you read all the CSV files into memory. This seems wasteful. What if I had hundreds of files but only wanted to read one of them? Why not wait to read a file until the caller needs it?

You convert numeric elements to floating-point numbers. This might not be what I want. For example, if I have a file containing:

Apollo,Launch
7,19681011
8,19681221
9,19690303
10,19690518
11,19690716
12,19691114
13,19700411
14,19710131
15,19710726
16,19720416
17,19721207

and then I try to read it, all the data has been wrongly converted to floating-point:

>>> e = EasyCSV('.')
>>> e.read()
apollo.csv
>>> from pprint import pprint
>>> pprint(e.allData[e.Dict['apollo']])
[['Apollo', 'Launch'],
 [7.0, 19681011.0],
 [8.0, 19681221.0],
 [9.0, 19690303.0],
 [10.0, 19690518.0],
 [11.0, 19690716.0],
 [12.0, 19691114.0],
 [13.0, 19700411.0],
 [14.0, 19710131.0],
 [15.0, 19710726.0],
 [16.0, 19720416.0],
 [17.0, 19721207.0]]

This can go wrong in other ways. For example, suppose I have a CSV file like this:

product code,inventory
1a0,81
7b4,61
9c2,32
8d3,90
1e9,95
2f4,71

When I read it with your class, look at what happens to the sixth row:

>>> e = EasyCSV('.')
>>> e.read()
inventory.csv
>>> pprint(e.allData[e.Dict['inventory']])
[['product code', 'inventory'],
 ['1a0', 81.0],
 ['7b4', 61.0],
 ['9c2', 32.0],
 ['8d3', 90.0],
 [1000000000.0, 95.0],
 ['2f4', 71.0]]

You suggest that "other people can just copy and paste into their code" but this is never a good idea. How would you distribute bug fixes and other improvements? If you plan for other people to use your code, you should aim to make a package that can be distributed through the Python Package Index.

In summary, your class is misnamed: it does not seem to me as if it would be easy to use in practice.

Stack Exchange Network

Portable Python CSV class

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Portable Python CSV class

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions