1

lets say the string is <title>xyz</title> I want to extract the xyz out of the string. I used:

Pattern titlePattern = Pattern.compile("&lttitle&gt\\s*(.+?)\\s*&lt/title&gt");
Matcher titleMatcher = titlePattern.matcher(line);
String title=titleMatcher.group(1)); 

but I am getting an error for titlePattern.matcher(line);

rsp
23.4k6 gold badges59 silver badges72 bronze badges
asked Dec 31, 2009 at 16:32
2
  • 1
    You need to tell people what the problem is if you expect us to debug it. Commented Dec 31, 2009 at 16:36
  • 1
    Somebody warn this guy about re + html! Commented Dec 31, 2009 at 16:37

3 Answers 3

5

You say your error occurs earlier (what is the actual error, runs without an error for me), but after solving that you will need to call find() on the matcher once to actually search for the pattern:

if(titleMatcher.find()){
 String title = titleMatcher.group(1);
}

Not that if you really match against a string with non-escaped HTML entities like

<title>xyz</title>

Then your regular expression will have to use these, not the escaped entities:

"<title>\\s*(.+?)\\s*</title>"

Also, you should be careful about how far you try to get with this, as you can't really parse HTML or XML with regular expressions. If you are working with XML, it's much easier to use an XML parser, e.g. JDOM.

answered Dec 31, 2009 at 16:35
Sign up to request clarification or add additional context in comments.

4 Comments

yeah i cant seem to find it. is this line wrong Pattern titlePattern = Pattern.compile("&lttitle&gt\\s*(.+?)\\s*&lt/title&gt");
Are you getting an exception or are you just not getting the correct result?
it seems titleMatcher.find() always evaluates to false even though the string is <title>xyz</title>. So my only concern is that this part is wrong &lttitle&gt\\s*(.+?)\\s*&lt/title&gt
Yes, see my addition to the answer. I tried it with the regular expression I gave in the end, using the string above of that and it works for me.
2

Not technically an answer but you shouldn't be using regular expressions to parse HTML. You can try and you can get away with it for simple tasks but HTML can get ugly. There are a number of Java libraries that can parse HTML/XML just fine. If you're going to be working a lot with HTML/XML it would be worth your time to learn them.

answered Dec 31, 2009 at 16:44

Comments

1

As others have suggested, it's probably not a good idea to parse HTML/XML with regex. You can parse XML Documents with the standard java API, but I don't recommend it. As Fabian Steeg already answered, it's probably better to use JDOM or a similar open source library for parsing XML.

With javax.xml.parsers you can do the following:

String xml = "<title>abc</title>";
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new InputSource(new StringReader(xml)));
NodeList nodeList = doc.getElementsByTagName("title");
String title = nodeList.item(0).getTextContent();

This parses your XML string into a Document object which you can use for further lookups. The API is kinda horrible though.

Another way is to use XPath for the lookup:

XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xPath = xpathFactory.newXPath();
String titleByXpath = xPath.evaluate("/title/text()", new InputSource(new StringReader(xml)));
// or use the Document for lookup
String titleFromDomByXpath = xPath.evaluate("/title/text()", doc);
answered Dec 31, 2009 at 17:28

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.