Complicated parsing in python

Question 1

I have a weird parsing problem with python. I need to parse the following text.

Here I need only the section between(not including) "pre" tag and column of numbers (starting with 205 4 164). I have several pages in this format.

<html>
<pre>
A Short Study of Notation Efficiency
CACM August, 1960
Smith Jr., H. J.
CA600802 JB March 20, 1978 9:02 PM
205 4 164
210 4 164
214 4 164
642 4 164
1 5 164
</pre>
</html>

Question 2

What parts are you trying to parse? What result format are you seeking?

Question 3

I just want this part: A Short Study of Notation Efficiency CACM August, 1960 Smith Jr., H. J. CA600802 JB March 20, 1978 9:02 PM

Question 4

The part between <pre> and column of numbers. I am good with a string. From there I can work. Thanks.

Question 5

Quazi, this calls out for a regex, specifically <pre>(.+?)(?:\d+\s+){3} with the DOTALL flag enabled.

You can find out about how to use regex in Python at http://docs.python.org/library/re.html and if you do a lot of this sort of string extraction, you'll be very glad you did. Going over my provided regex piece-by-piece:

<pre> just directly matches the pre tag
(.+?) matches and captures any characters
(?:\d+\s+){3} matches against some numbers followed by some whitespace, three times in a row

Question 6

@minitech, thanks for the correction! I hadn't noticed that SO had gobbled my pre tag.

Question 7

Here's a regular expression to do that:

findData = re.compile('(?<=<pre>).+?(?=[\d\s]*</pre>)', re.S)
# ...
result = findData.search(data).group(0).strip()

Here's a demo.

Question 8

Not exactly what the OP wants - based on the comments to the OP he only wants the four text lines before the columns of numbers.

Question 9

@Li-aungYip: And that's exactly what this code does. Just not as a list, like yours. Is that the problem?

Question 10

Your regex group(0) includes the columns of numbers. See the output.

Question 11

I'd probably use lxml or BeautifulSoup. IMO, regex's are heavily overused, especially for parsing up HTML.

Question 12

Other people have offered up regex solutions, which are good but may behave unexpectedly at times.

If the pages are exactly as shown in your example, that is:

No other HTML tags are present - only the <html> and <pre> tags
The number of lines is always consistent
The spacing between lines is always consistent

Then a simple approach like like this will do:

my_text = """<html>
<pre>
A Short Study of Notation Efficiency
CACM August, 1960
Smith Jr., H. J.
CA600802 JB March 20, 1978 9:02 PM
205 4 164
210 4 164
214 4 164
642 4 164
1 5 164
</pre>
</html>"""
lines = my_text.split("\n")
title = lines[4]
journal = lines[6]
author = lines[8]
date = lines[10]

If you can't guarantee the spacing between lines, but you can guarantee that you only want the first four non-whitespace lines inside the <html><pre>;

import pprint
max_extracted_lines = 4
extracted_lines = []
for line in lines:
 if line == "<html>" or line == "<pre>":
 continue
 if line:
 extracted_lines.append(line)
 if len(extracted_lines) >= max_extracted_lines:
 break
pprint.pprint(extracted_lines)

Giving output:

['A Short Study of Notation Efficiency',
 'CACM August, 1960',
 'Smith Jr., H. J.',
 'CA600802 JB March 20, 1978 9:02 PM']

Don't use regex where simple string operations will do.

Question 13

I couldn't disagree more; regexes do not "behave unexpectedly at times", they follow very straightforward rules. On the other hand, making unnecessary assumptions about minor details of the format of the input data is likely to backfire.

Question 14

Thanks for the alternate approach unfortunately there really is not a way to ensure that all the factors will be consistant. But other people's regex has worked well. Thank you for your time.

Question 15

@QuaziFarhan: no worries. As with all things, you should use the simplest approach that works - but no simpler. This approach is evidently a little too simplistic. ;)

DSimon 3,4202 gold badges23 silver badges26 bronze badges · Accepted Answer · 2012-04-09 23:21:28Z

Quazi, this calls out for a regex, specifically <pre>(.+?)(?:\d+\s+){3} with the DOTALL flag enabled.

You can find out about how to use regex in Python at http://docs.python.org/library/re.html and if you do a lot of this sort of string extraction, you'll be very glad you did. Going over my provided regex piece-by-piece:

<pre> just directly matches the pre tag
(.+?) matches and captures any characters
(?:\d+\s+){3} matches against some numbers followed by some whitespace, three times in a row

@minitech, thanks for the correction! I hadn't noticed that SO had gobbled my pre tag.

CollectivesTM on Stack Overflow

Complicated parsing in python

4 Answers 4

1 Comment

3 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

4 Answers 4

1 Comment

3 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related