Parsing a string pattern (Python)

Question 1

I have a file with following data:

<<row>>12|xyz|abc|2.34<</row>>
<<eof>>

The file may have several rows like this. I am trying to design a parser which will parse each row present in this file and return an array with all rows. What would be the best way of doing it? The code has to be written in python. Code should not take rows that do not start with <<row>> or should raise error.

=======> UPDATE <========

I just found that a particular <<row>> can span multiple lines. So my code and the code present below aren't working anymore. Can someone please suggest an efficient solution?

The data files can contain hundreds to several thousands of rows.

Question 2

Looks like a pretty straightforward task. Where are you having problems?

Question 3

It is a simple task I know but I want to know how a different programmer would solve it. So.

Question 4

Post the solution you already have. You will get advise how to improve on it.

Question 5

While working with the code, i found that rows in the data files are not restricted to one line. So a particular <<row>> can span multiple lines. So my code isn't working anymore. And neither the ones answered below. Can you please help? Should i re-post this as a new question? or edit the question?

Question 6

A simple way without regular expressions:

output = []
with open('input.txt', 'r') as f:
 for line in f:
 if line == '<<eof>>':
 break
 elif not line.startswith('<<row>>'):
 continue
 else:
 output.append(line.strip()[7:-8].split('|'))

This uses every line starting with <<row>> until a line contains only <<eof>>

Question 7

def parseFile(fileName):
 with open(fileName) as f:
 def parseLine(line):
 m = re.match(r'<<row>>(\d+)\|(\w+)\|(\w+)\|([\d\.]+)<</row>>$', line)
 if m:
 return m.groups()
 return [ values for values in (
 parseLine(line)
 for line in f
 if line.startswith('<<row>>')) if values ]

And? Am I different? ;-)

Question 8

I guess. But doing it without using regular expressions is better I believe.

Question 9

How come you believe this?? Regexp is more general. Using split etc. always is kind of using a special version for a special case. In case in the future a slightly modified version of the format pops up, adjusting the regexp is a cinch while making up a new version using simpler parsing mechanisms quickly is unable to cope with the task.

Question 10

String library is faster. So it would make more sense for me to do it without using regex as these files are going to contain thousands of rows. These files contain data that we are buying from a data provider so i have no choice in terms of input data format.

Question 11

And to solve the issue of version updates, I'm putting the whole parser into a class with constants that store the beginning and ending sequence. So later i can just change the values. That is why i was looking for a different solution as I thought i might unnecessarily be making a whole class when the task is very simple.

Question 12

Ah, »The file may have several rows like this« does not really sound like thousands ;-) In this case I'd propose to use a generator to produce the output (using yield). Neither TobiMarg's nor my solution then is appropriate.

Question 13

A good practice is to test for unwanted cases and ignore them. Once you are sure that you have a compliant line, you process it. Note that the actual processing is not in an if statement. Without rows split across several lines, you need only two tests:

rows = list()
with open('newfile.txt') as file:
 for line in file.readlines():
 line = line.strip()
  if not line.startswith('<<row>>'):
  continue
  if not line[-8:] == '<</row>>':
     continue
  row = line[7:-8]
  rows.append(row)

With rows split across several lines, you need to save the previous line in some situations:

rows = list()
prev = None
with open('newfile.txt') as file:
 for line in file.readlines():
  line = line.strip()
  if not line.startswith('<<row>>') and prev is not None:
    line = prev + line
  if not line.startswith('<<row>>'):
    continue
  if not line[-8:] == '<</row>>':
    prev = line
    continue
  row = line[7:-8]
  rows.append(row)
  prev = None

If needed, you can split columns with: cols = row.split('|')

TobiMarg 3,8471 gold badge22 silver badges25 bronze badges · Accepted Answer · 2013-05-27 19:18:39Z

A simple way without regular expressions:

output = []
with open('input.txt', 'r') as f:
 for line in f:
 if line == '<<eof>>':
 break
 elif not line.startswith('<<row>>'):
 continue
 else:
 output.append(line.strip()[7:-8].split('|'))

This uses every line starting with <<row>> until a line contains only <<eof>>

CollectivesTM on Stack Overflow

Parsing a string pattern (Python)

3 Answers 3

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related