CSV parsing in Python

Question 1

I want to parse a csv file which is in the following format:

Test Environment INFO for 1 line.
Test,TestName1,
TestAttribute1-1,TestAttribute1-2,TestAttribute1-3
TestAttributeValue1-1,TestAttributeValue1-2,TestAttributeValue1-3
Test,TestName2,
TestAttribute2-1,TestAttribute2-2,TestAttribute2-3
TestAttributeValue2-1,TestAttributeValue2-2,TestAttributeValue2-3
Test,TestName3,
TestAttribute3-1,TestAttribute3-2,TestAttribute3-3
TestAttributeValue3-1,TestAttributeValue3-2,TestAttributeValue3-3
Test,TestName4,
TestAttribute4-1,TestAttribute4-2,TestAttribute4-3
TestAttributeValue4-1-1,TestAttributeValue4-1-2,TestAttributeValue4-1-3
TestAttributeValue4-2-1,TestAttributeValue4-2-2,TestAttributeValue4-2-3
TestAttributeValue4-3-1,TestAttributeValue4-3-2,TestAttributeValue4-3-3

and would like to turn this into tab seperated format like in the following:

TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3

Number of TestAttributes vary from test to test. For some tests there are only 3 values, for some others 7, etc. Also as in TestName4 example, some tests are executed more than once and hence each execution has its own TestAttributeValue line. (in the example testname4 is executed 3 times, hence we have 3 value lines)

I am new to python and do not have much knowledge but would like to parse the csv file with python. I checked 'csv' library of python and could not be sure whether it will be enough for me or shall I write my own string parser? Could you please help me?

Best

Question 2

Did you actually try the csv module? Did it work? If not, what didn't work?

Question 3

Using csv.reader with the parameter delimiter set to "," will allow you to retrieve the content of the file as lists of strings. From there you'll need to reformat the whole structure.

Question 4

@LutzHorn Actually I could not look in detail to csv module, I hope I will have time in a few hours. However as long as I understood it seems like in my case it is only useful to seperate the texts with the "," in between. So I thought what is the use of that csv module? I can do that by writing a simple text parser which checks whether "," exists or not. I am curious if csv module can be more useful than only finding "," and seperating the values for my case. I do not know if I am looking for magic :)

Question 5

CSV could also be named DSV: Delimiter Separated Values. The delimiter could also be whitespace. You should 1) find a way to split your input in blocks, and 2) parse these blocks as CSV.

Question 6

I'd use a solution using the itertools.groupby function and the csv module. Please have a close look at the documentation of itertools -- you can use it more often than you think!

I've used blank lines to differentiate the datasets, and this approach uses lazy evaluation, storing only one dataset in memory at a time:

import csv
from itertools import groupby
with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
 # Use the csv module to handle reading and writing of delimited files.
 reader = csv.reader(ifile)
 writer = csv.writer(ofile, delimiter='\t')
 # Skip info line
 next(reader)
 # Group datasets by the condition if len(row) > 0 or not, then filter
 # out all empty lines
 for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
 test_data = list(group)
 # Write header
 writer.writerow([test_data[0][1]])
 # Write transposed data
 writer.writerows(zip(*test_data[1:]))
 # Write blank line
 writer.writerow([])

Output, given that the supplied data is stored in my_data.csv:

TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3

Question 7

The following does what you want, and only reads up to one section at a time (saves memory for a large file). Replace in_path and out_path with the input and output file paths respectively:

import csv
def print_section(section, f_out):
 if len(section) > 0:
 # find maximum column length
 max_len = max([len(col) for col in section])
 # build and print each row
 for i in xrange(max_len):
 f_out.write('\t'.join([col[i] if len(col) > i else '' for col in section]) + '\n')
 f_out.write('\n')
with csv.reader(open(in_path, 'r')) as f_in, open(out_path, 'w') as f_out:
 line = f_in.next()
 section = []
 for line in f_in:
 # test for new "Test" section
 if len(line) == 3 and line[0] == 'Test' and line[2] == '':
 # write previous section data
 print_section(section, f_out)
 # reset section
 section = []
 # write new section header
 f_out.write(line[1] + '\n')
 else:
 # add line to section
 section.append(line)
 # print the last section
 print_section(section, f_out)

Note that you'll want to change 'Test' in the line[0] == 'Test' statement to the correct word for indicating the header line.

The basic idea here is that we import the file into a list of lists, then write that list of lists back out using an array comprehension to transpose it (as well as adding in blank elements when the columns are uneven).

Question 8

a) Use the csv module when dealing with delimited files, and b) to transpose a matrix, use zip(*iterable)

Question 9

@SteinarLima a) Module used now. In this case, though, complexity was not reduced. b) zip(*iterable) silently drops data in uneven columns. In my experience, few users desire data to disappear in that manner.

Question 10

b) izip_longest from itertools can be used if you don't want that behavior.

Question 11

@SteinarLima Thanks! I forgot to check itertools. I may update the code above after work today.

Question 12

The csv module is superior to split(',') in many ways - the most important is that it handles quotation. The line 1,"me, you and him",2 should be split into 3 parts, not 4 for instance.

Steinar Lima 7,8292 gold badges41 silver badges39 bronze badges · Accepted Answer · 2014-03-20 03:42:20Z

I'd use a solution using the itertools.groupby function and the csv module. Please have a close look at the documentation of itertools -- you can use it more often than you think!

I've used blank lines to differentiate the datasets, and this approach uses lazy evaluation, storing only one dataset in memory at a time:

import csv
from itertools import groupby
with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
 # Use the csv module to handle reading and writing of delimited files.
 reader = csv.reader(ifile)
 writer = csv.writer(ofile, delimiter='\t')
 # Skip info line
 next(reader)
 # Group datasets by the condition if len(row) > 0 or not, then filter
 # out all empty lines
 for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
 test_data = list(group)
 # Write header
 writer.writerow([test_data[0][1]])
 # Write transposed data
 writer.writerows(zip(*test_data[1:]))
 # Write blank line
 writer.writerow([])

Output, given that the supplied data is stored in my_data.csv:

TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3
TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3
TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3
TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3

CollectivesTM on Stack Overflow

CSV parsing in Python

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related