lets say the string is <title>xyz</title>
I want to extract the xyz out of the string.
I used:
Pattern titlePattern = Pattern.compile("<title>\\s*(.+?)\\s*</title>");
Matcher titleMatcher = titlePattern.matcher(line);
String title=titleMatcher.group(1));
but I am getting an error for titlePattern.matcher(line);
-
1You need to tell people what the problem is if you expect us to debug it.Hank Gay– Hank Gay2009年12月31日 16:36:46 +00:00Commented Dec 31, 2009 at 16:36
-
1Somebody warn this guy about re + html!Hamish Grubijan– Hamish Grubijan2009年12月31日 16:37:15 +00:00Commented Dec 31, 2009 at 16:37
3 Answers 3
You say your error occurs earlier (what is the actual error, runs without an error for me), but after solving that you will need to call find() on the matcher once to actually search for the pattern:
if(titleMatcher.find()){
String title = titleMatcher.group(1);
}
Not that if you really match against a string with non-escaped HTML entities like
<title>xyz</title>
Then your regular expression will have to use these, not the escaped entities:
"<title>\\s*(.+?)\\s*</title>"
Also, you should be careful about how far you try to get with this, as you can't really parse HTML or XML with regular expressions. If you are working with XML, it's much easier to use an XML parser, e.g. JDOM.
4 Comments
Not technically an answer but you shouldn't be using regular expressions to parse HTML. You can try and you can get away with it for simple tasks but HTML can get ugly. There are a number of Java libraries that can parse HTML/XML just fine. If you're going to be working a lot with HTML/XML it would be worth your time to learn them.
Comments
As others have suggested, it's probably not a good idea to parse HTML/XML with regex. You can parse XML Documents with the standard java API, but I don't recommend it. As Fabian Steeg already answered, it's probably better to use JDOM or a similar open source library for parsing XML.
With javax.xml.parsers you can do the following:
String xml = "<title>abc</title>";
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new InputSource(new StringReader(xml)));
NodeList nodeList = doc.getElementsByTagName("title");
String title = nodeList.item(0).getTextContent();
This parses your XML string into a Document object which you can use for further lookups. The API is kinda horrible though.
Another way is to use XPath for the lookup:
XPathFactory xpathFactory = XPathFactory.newInstance();
XPath xPath = xpathFactory.newXPath();
String titleByXpath = xPath.evaluate("/title/text()", new InputSource(new StringReader(xml)));
// or use the Document for lookup
String titleFromDomByXpath = xPath.evaluate("/title/text()", doc);