I have a weird parsing problem with python. I need to parse the following text.
Here I need only the section between(not including) "pre" tag and column of numbers (starting with 205 4 164). I have several pages in this format.
<html>
<pre>
A Short Study of Notation Efficiency
CACM August, 1960
Smith Jr., H. J.
CA600802 JB March 20, 1978 9:02 PM
205 4 164
210 4 164
214 4 164
642 4 164
1 5 164
</pre>
</html>
-
What parts are you trying to parse? What result format are you seeking?sblom– sblom2012年04月09日 23:04:16 +00:00Commented Apr 9, 2012 at 23:04
-
I just want this part: A Short Study of Notation Efficiency CACM August, 1960 Smith Jr., H. J. CA600802 JB March 20, 1978 9:02 PMQuazi Farhan– Quazi Farhan2012年04月09日 23:07:21 +00:00Commented Apr 9, 2012 at 23:07
-
The part between <pre> and column of numbers. I am good with a string. From there I can work. Thanks.Quazi Farhan– Quazi Farhan2012年04月09日 23:08:03 +00:00Commented Apr 9, 2012 at 23:08
4 Answers 4
Quazi, this calls out for a regex, specifically <pre>(.+?)(?:\d+\s+){3} with the DOTALL flag enabled.
You can find out about how to use regex in Python at http://docs.python.org/library/re.html and if you do a lot of this sort of string extraction, you'll be very glad you did. Going over my provided regex piece-by-piece:
<pre> just directly matches the pre tag
(.+?) matches and captures any characters
(?:\d+\s+){3} matches against some numbers followed by some whitespace, three times in a row
1 Comment
Here's a regular expression to do that:
findData = re.compile('(?<=<pre>).+?(?=[\d\s]*</pre>)', re.S)
# ...
result = findData.search(data).group(0).strip()
3 Comments
group(0) includes the columns of numbers. See the output.I'd probably use lxml or BeautifulSoup. IMO, regex's are heavily overused, especially for parsing up HTML.
Comments
Other people have offered up regex solutions, which are good but may behave unexpectedly at times.
If the pages are exactly as shown in your example, that is:
- No other HTML tags are present - only the
<html>and<pre>tags - The number of lines is always consistent
- The spacing between lines is always consistent
Then a simple approach like like this will do:
my_text = """<html>
<pre>
A Short Study of Notation Efficiency
CACM August, 1960
Smith Jr., H. J.
CA600802 JB March 20, 1978 9:02 PM
205 4 164
210 4 164
214 4 164
642 4 164
1 5 164
</pre>
</html>"""
lines = my_text.split("\n")
title = lines[4]
journal = lines[6]
author = lines[8]
date = lines[10]
If you can't guarantee the spacing between lines, but you can guarantee that you only want the first four non-whitespace lines inside the <html><pre>;
import pprint
max_extracted_lines = 4
extracted_lines = []
for line in lines:
if line == "<html>" or line == "<pre>":
continue
if line:
extracted_lines.append(line)
if len(extracted_lines) >= max_extracted_lines:
break
pprint.pprint(extracted_lines)
Giving output:
['A Short Study of Notation Efficiency',
'CACM August, 1960',
'Smith Jr., H. J.',
'CA600802 JB March 20, 1978 9:02 PM']
Don't use regex where simple string operations will do.