3

I'm creating a little web app to help me manage and analyze the content of my websites, and cURL is my favorite new toy. I've figured out how to extract info about all sorts of elements, how to find all elements with a certain class, etc., but I am stuck on two problems (see below). I hope there is some nifty xpath answer, but if I have to resort to regular expressions I guess that's ok. Although I'm not so great with regex so if you think that's the way to go, I'd appreciate examples...

Pretty standard starting point:

$ch = curl_init();
 curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
 curl_setopt($ch, CURLOPT_URL,$target_url);
 curl_setopt($ch, CURLOPT_FAILONERROR, true);
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
 curl_setopt($ch, CURLOPT_AUTOREFERER, true);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
 curl_setopt($ch, CURLOPT_TIMEOUT, 10);
 $html = curl_exec($ch);
 if (!$html) {
 $info .= "<br />cURL error number:" .curl_errno($ch);
 $info .= "<br />cURL error:" . curl_error($ch);
 return $info;
 }
 $dom = new DOMDocument();
 @$dom->loadHTML($html);
 $xpath = new DOMXPath($dom);

and extraction of info, for example:

// iframes
 $iframes = $xpath->evaluate("/html/body//iframe");
 $info .= '<h3>iframes ('.$iframes->length.'):</h3>';
 for ($i = 0; $i < $iframes->length; $i++) {
 // get iframe attributes
 $iframe = $iframes->item($i);
 $framesrc = $iframe->getAttribute("src");
 $framewidth = $iframe->getAttribute("width");
 $frameheight = $iframe->getAttribute("height");
 $framealt = $iframe->getAttribute("alt");
 $frameclass = $iframe->getAttribute("class");
 $info .= $framesrc.'&nbsp;('.$framewidth.'x'.$frameheight.'; class="'.$frameclass.'")'.'<br />';
 }

Questions/Problems:

  1. How to extract HTML comments?

    I can't figure out how to identify the comments – are they considered nodes, or something else entirely?

  2. How to get the entire content of a div, including child nodes? So if the div contains an image and a couple of hrefs, it would find those and hand it all back to me as a block of HTML.

Marcel Korpel
21.9k6 gold badges63 silver badges80 bronze badges
asked May 18, 2011 at 21:54
1

4 Answers 4

15

Comment nodes should be easy to find in XPath with the comment() test, analogous to the text() test:

$comments = $xpath->query('//comment()'); // or another path, as you prefer

They are standard nodes: here is the manual entry for the DOMComment class.


To your other question, it's a bit trickier. The simplest way is to use saveXML() with its optional $node argument:

$html = $dom->saveXML($el); // $el should be the element you want to get 
 // the HTML for
answered May 18, 2011 at 22:04
Sign up to request clarification or add additional context in comments.

2 Comments

Hooray! Thank you so much. Problems solved. By the way, if you don't mind, in this context what's the difference between $xpath->query an $xpath->evaluate?
@alison Glad to help. Here there is no difference.
1

For the HTML comments a fast method is:

 function getComments ($html) {
 $rcomments = array();
 $comments = array();
 if (preg_match_all('#<\!--(.*?)-->#is', $html, $rcomments)) {
 foreach ($rcomments as $c) {
 $comments[] = $c[1];
 }
 return $comments;
 } else {
 // No comments matchs
 return null;
 }
 }
Marcel Korpel
21.9k6 gold badges63 silver badges80 bronze badges
answered May 18, 2011 at 22:06

Comments

0

That Regex \s*<!--[\s\S]+?-->
Helps to you.

In regex Test

answered Oct 26, 2019 at 17:50

Comments

-3

for comments your looking for recursive regex. For instance, to get rid of html comments:

preg_replace('/<!--(?(?=<!--)(?R)|.)*?-->/s',$yourHTML);

to find them:

preg_match_all('/(<!--(?(?=<!--)(?R)|.)*?-->)/s',$yourHTML,$comments);
TheHippo
63.3k15 gold badges77 silver badges101 bronze badges
answered May 28, 2013 at 20:27

1 Comment

This doesn't work. I tried it.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.