I have a file with following data:
<<row>>12|xyz|abc|2.34<</row>>
<<eof>>
The file may have several rows like this. I am trying to design a parser which will parse each row present in this file and return an array with all rows. What would be the best way of doing it? The code has to be written in python. Code should not take rows that do not start with <<row>> or should raise error.
=======> UPDATE <========
I just found that a particular <<row>> can span multiple lines. So my code and the code present below aren't working anymore. Can someone please suggest an efficient solution?
The data files can contain hundreds to several thousands of rows.
3 Answers 3
A simple way without regular expressions:
output = []
with open('input.txt', 'r') as f:
for line in f:
if line == '<<eof>>':
break
elif not line.startswith('<<row>>'):
continue
else:
output.append(line.strip()[7:-8].split('|'))
This uses every line starting with <<row>> until a line contains only <<eof>>
Comments
def parseFile(fileName):
with open(fileName) as f:
def parseLine(line):
m = re.match(r'<<row>>(\d+)\|(\w+)\|(\w+)\|([\d\.]+)<</row>>$', line)
if m:
return m.groups()
return [ values for values in (
parseLine(line)
for line in f
if line.startswith('<<row>>')) if values ]
And? Am I different? ;-)
5 Comments
split etc. always is kind of using a special version for a special case. In case in the future a slightly modified version of the format pops up, adjusting the regexp is a cinch while making up a new version using simpler parsing mechanisms quickly is unable to cope with the task.yield). Neither TobiMarg's nor my solution then is appropriate.A good practice is to test for unwanted cases and ignore them. Once you are sure that you have a compliant line, you process it. Note that the actual processing is not in an if statement. Without rows split across several lines, you need only two tests:
rows = list()
with open('newfile.txt') as file:
for line in file.readlines():
line = line.strip()
if not line.startswith('<<row>>'):
continue
if not line[-8:] == '<</row>>':
continue
row = line[7:-8]
rows.append(row)
With rows split across several lines, you need to save the previous line in some situations:
rows = list()
prev = None
with open('newfile.txt') as file:
for line in file.readlines():
line = line.strip()
if not line.startswith('<<row>>') and prev is not None:
line = prev + line
if not line.startswith('<<row>>'):
continue
if not line[-8:] == '<</row>>':
prev = line
continue
row = line[7:-8]
rows.append(row)
prev = None
If needed, you can split columns with: cols = row.split('|')
<<row>>can span multiple lines. So my code isn't working anymore. And neither the ones answered below. Can you please help? Should i re-post this as a new question? or edit the question?