I often need to parse tab-separated text (usually from a huge file) into records. I wrote a generator to do that for me; is there anything that could be improved in it, in terms of performance, extensibility or generality?
def table_parser(in_stream, types = None, sep = '\t', endl = '\n', comment = None):
header = next(in_stream).rstrip(endl).split(sep)
for lineno, line in enumerate(in_stream):
if line == endl:
continue # ignore blank lines
if line[0] == comment:
continue # ignore comments
fields = line.rstrip(endl).split(sep)
try:
# could have done this outside the loop instead:
# if types is None: types = {c : (lambda x : x) for c in headers}
# but it nearly doubles the run-time if types actually is None
if types is None:
record = {col : fields[no] for no, col in enumerate(header)}
else:
record = {col : types[col](fields[no]) for no, col in enumerate(header)}
except IndexError:
print('Insufficient columns in line #{}:\n{}'.format(lineno, line))
raise
yield record
-
\$\begingroup\$ @RikPoggi: I asked moderator to move it there. Thank you \$\endgroup\$max– max2012年02月02日 09:56:42 +00:00Commented Feb 2, 2012 at 9:56
2 Answers 2
One thing you could try to reduce the amount of code in the loop is to make a function expression for these.
if types is None:
record = {col : fields[no] for no, col in enumerate(header)}
else:
record = {col : types[col](fields[no]) for no, col in enumerate(header)}
something like this: not tested but you should get the idea
def table_parser(in_stream, types = None, sep = '\t', endl = '\n', comment = None):
header = next(in_stream).rstrip(endl).split(sep)
enumheader=enumerate(header) #### No need to do this every time
if types is None:
def recorder(col,fields):
return {col : fields[no] for no, col in enumheader}
else:
def recorder(col,fields):
return {col : types[col](fields[no]) for no, col in enumheader}
for lineno, line in enumerate(in_stream):
if line == endl:
continue # ignore blank lines
if line[0] == comment:
continue # ignore comments
fields = line.rstrip(endl).split(sep)
try:
record = recorder(col,fields)
except IndexError:
print('Insufficient columns in line #{}:\n{}'.format(lineno, line))
raise
yield record
EDIT: from my first version (read comments)
Tiny thing:
if types is None:
I suggest
if not types:
-
\$\begingroup\$ In many cases I think
if types is None
would be preferable. I think I see why you disagree in this case, but I'm curious... could you say more? \$\endgroup\$senderle– senderle2012年02月02日 09:48:43 +00:00Commented Feb 2, 2012 at 9:48 -
\$\begingroup\$ Generally do not test types if it's not clear that it's required. (duck type). In this case I can't come up with a specific benefit, some other type evaluated to false in a boolean context? Also
is
is testing the identity and would be false even if the other None type was identical to None. Completely academic, yes. \$\endgroup\$Johan Lundberg– Johan Lundberg2012年02月02日 09:57:12 +00:00Commented Feb 2, 2012 at 9:57 -
2\$\begingroup\$ Well, PEP 8 says "Comparisons to singletons like None should always be done with 'is' or 'is not', never the equality operators." In this particular case, using
is
makes it possible to distinguish between cases where[]
was passed and cases where no variable was passed. \$\endgroup\$senderle– senderle2012年02月02日 10:03:06 +00:00Commented Feb 2, 2012 at 10:03 -
\$\begingroup\$ All right that makes sense. From your link: A Foolish Consistency is the Hobgoblin of Little Minds \$\endgroup\$Johan Lundberg– Johan Lundberg2012年02月02日 10:08:51 +00:00Commented Feb 2, 2012 at 10:08
You also may use csv module to iterate over your file. Your code would be faster because of C implementation and cleaner without line.rstrip(endl).split(sep)
Explore related questions
See similar questions with these tags.