3
\$\begingroup\$

I often need to parse tab-separated text (usually from a huge file) into records. I wrote a generator to do that for me; is there anything that could be improved in it, in terms of performance, extensibility or generality?

def table_parser(in_stream, types = None, sep = '\t', endl = '\n', comment = None):
 header = next(in_stream).rstrip(endl).split(sep)
 for lineno, line in enumerate(in_stream):
 if line == endl:
 continue # ignore blank lines
 if line[0] == comment:
 continue # ignore comments
 fields = line.rstrip(endl).split(sep)
 try:
 # could have done this outside the loop instead:
 # if types is None: types = {c : (lambda x : x) for c in headers}
 # but it nearly doubles the run-time if types actually is None
 if types is None:
 record = {col : fields[no] for no, col in enumerate(header)}
 else:
 record = {col : types[col](fields[no]) for no, col in enumerate(header)}
 except IndexError:
 print('Insufficient columns in line #{}:\n{}'.format(lineno, line))
 raise
 yield record
palacsint
30.3k9 gold badges81 silver badges157 bronze badges
asked Feb 2, 2012 at 9:40
\$\endgroup\$
1
  • \$\begingroup\$ @RikPoggi: I asked moderator to move it there. Thank you \$\endgroup\$ Commented Feb 2, 2012 at 9:56

2 Answers 2

1
\$\begingroup\$

One thing you could try to reduce the amount of code in the loop is to make a function expression for these.

 if types is None:
 record = {col : fields[no] for no, col in enumerate(header)}
 else:
 record = {col : types[col](fields[no]) for no, col in enumerate(header)}

something like this: not tested but you should get the idea

def table_parser(in_stream, types = None, sep = '\t', endl = '\n', comment = None):
 header = next(in_stream).rstrip(endl).split(sep)
 enumheader=enumerate(header) #### No need to do this every time
 if types is None:
 def recorder(col,fields): 
 return {col : fields[no] for no, col in enumheader}
 else:
 def recorder(col,fields): 
 return {col : types[col](fields[no]) for no, col in enumheader}
 for lineno, line in enumerate(in_stream):
 if line == endl:
 continue # ignore blank lines
 if line[0] == comment:
 continue # ignore comments
 fields = line.rstrip(endl).split(sep)
 try:
 record = recorder(col,fields)
 except IndexError:
 print('Insufficient columns in line #{}:\n{}'.format(lineno, line))
 raise
 yield record

EDIT: from my first version (read comments)

Tiny thing:
 if types is None:
I suggest
 if not types:
answered Feb 2, 2012 at 9:44
\$\endgroup\$
4
  • \$\begingroup\$ In many cases I think if types is None would be preferable. I think I see why you disagree in this case, but I'm curious... could you say more? \$\endgroup\$ Commented Feb 2, 2012 at 9:48
  • \$\begingroup\$ Generally do not test types if it's not clear that it's required. (duck type). In this case I can't come up with a specific benefit, some other type evaluated to false in a boolean context? Also is is testing the identity and would be false even if the other None type was identical to None. Completely academic, yes. \$\endgroup\$ Commented Feb 2, 2012 at 9:57
  • 2
    \$\begingroup\$ Well, PEP 8 says "Comparisons to singletons like None should always be done with 'is' or 'is not', never the equality operators." In this particular case, using is makes it possible to distinguish between cases where [] was passed and cases where no variable was passed. \$\endgroup\$ Commented Feb 2, 2012 at 10:03
  • \$\begingroup\$ All right that makes sense. From your link: A Foolish Consistency is the Hobgoblin of Little Minds \$\endgroup\$ Commented Feb 2, 2012 at 10:08
2
\$\begingroup\$

You also may use csv module to iterate over your file. Your code would be faster because of C implementation and cleaner without line.rstrip(endl).split(sep)

answered Feb 2, 2012 at 19:06
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.