Text parser implemented as a generator

Question 1

I often need to parse tab-separated text (usually from a huge file) into records. I wrote a generator to do that for me; is there anything that could be improved in it, in terms of performance, extensibility or generality?

def table_parser(in_stream, types = None, sep = '\t', endl = '\n', comment = None):
 header = next(in_stream).rstrip(endl).split(sep)
 for lineno, line in enumerate(in_stream):
 if line == endl:
 continue # ignore blank lines
 if line[0] == comment:
 continue # ignore comments
 fields = line.rstrip(endl).split(sep)
 try:
 # could have done this outside the loop instead:
 # if types is None: types = {c : (lambda x : x) for c in headers}
 # but it nearly doubles the run-time if types actually is None
 if types is None:
 record = {col : fields[no] for no, col in enumerate(header)}
 else:
 record = {col : types[col](fields[no]) for no, col in enumerate(header)}
 except IndexError:
 print('Insufficient columns in line #{}:\n{}'.format(lineno, line))
 raise
 yield record

Question 2

@RikPoggi: I asked moderator to move it there. Thank you

Question 3

One thing you could try to reduce the amount of code in the loop is to make a function expression for these.

 if types is None:
 record = {col : fields[no] for no, col in enumerate(header)}
 else:
 record = {col : types[col](fields[no]) for no, col in enumerate(header)}

something like this: not tested but you should get the idea

def table_parser(in_stream, types = None, sep = '\t', endl = '\n', comment = None):
 header = next(in_stream).rstrip(endl).split(sep)
 enumheader=enumerate(header) #### No need to do this every time
 if types is None:
 def recorder(col,fields): 
 return {col : fields[no] for no, col in enumheader}
 else:
 def recorder(col,fields): 
 return {col : types[col](fields[no]) for no, col in enumheader}
 for lineno, line in enumerate(in_stream):
 if line == endl:
 continue # ignore blank lines
 if line[0] == comment:
 continue # ignore comments
 fields = line.rstrip(endl).split(sep)
 try:
 record = recorder(col,fields)
 except IndexError:
 print('Insufficient columns in line #{}:\n{}'.format(lineno, line))
 raise
 yield record

EDIT: from my first version (read comments)

Tiny thing:
 if types is None:
I suggest
 if not types:

Question 4

In many cases I think if types is None would be preferable. I think I see why you disagree in this case, but I'm curious... could you say more?

Question 5

Generally do not test types if it's not clear that it's required. (duck type). In this case I can't come up with a specific benefit, some other type evaluated to false in a boolean context? Also is is testing the identity and would be false even if the other None type was identical to None. Completely academic, yes.

Question 6

Well, PEP 8 says "Comparisons to singletons like None should always be done with 'is' or 'is not', never the equality operators." In this particular case, using is makes it possible to distinguish between cases where [] was passed and cases where no variable was passed.

Question 7

All right that makes sense. From your link: A Foolish Consistency is the Hobgoblin of Little Minds

Question 8

You also may use csv module to iterate over your file. Your code would be faster because of C implementation and cleaner without line.rstrip(endl).split(sep)

Johan Lundberg Johan LundbergJohan Lundberg 1263 bronze badges · Accepted Answer · 2012-02-02 09:44:19Z

One thing you could try to reduce the amount of code in the loop is to make a function expression for these.

 if types is None:
 record = {col : fields[no] for no, col in enumerate(header)}
 else:
 record = {col : types[col](fields[no]) for no, col in enumerate(header)}

something like this: not tested but you should get the idea

def table_parser(in_stream, types = None, sep = '\t', endl = '\n', comment = None):
 header = next(in_stream).rstrip(endl).split(sep)
 enumheader=enumerate(header) #### No need to do this every time
 if types is None:
 def recorder(col,fields): 
 return {col : fields[no] for no, col in enumheader}
 else:
 def recorder(col,fields): 
 return {col : types[col](fields[no]) for no, col in enumheader}
 for lineno, line in enumerate(in_stream):
 if line == endl:
 continue # ignore blank lines
 if line[0] == comment:
 continue # ignore comments
 fields = line.rstrip(endl).split(sep)
 try:
 record = recorder(col,fields)
 except IndexError:
 print('Insufficient columns in line #{}:\n{}'.format(lineno, line))
 raise
 yield record

EDIT: from my first version (read comments)

Tiny thing:
 if types is None:
I suggest
 if not types:

In many cases I think if types is None would be preferable. I think I see why you disagree in this case, but I'm curious... could you say more?
Generally do not test types if it's not clear that it's required. (duck type). In this case I can't come up with a specific benefit, some other type evaluated to false in a boolean context? Also is is testing the identity and would be false even if the other None type was identical to None. Completely academic, yes.
Well, PEP 8 says "Comparisons to singletons like None should always be done with 'is' or 'is not', never the equality operators." In this particular case, using is makes it possible to distinguish between cases where [] was passed and cases where no variable was passed.
All right that makes sense. From your link: A Foolish Consistency is the Hobgoblin of Little Minds

Stack Exchange Network

Text parser implemented as a generator

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Text parser implemented as a generator

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions