Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/

edited May 23, 2017 at 12:40

Ouch, The first thing I noticed is that you're processing markup, and have chosen to do so using regular expressions. That's a bad idea That's a bad idea from the off.
It's especially bad considering you have a tool at your disposal that is specifically built to parse markup, the DOMDocument object.

Ouch, The first thing I noticed is that you're processing markup, and have chosen to do so using regular expressions. That's a bad idea from the off.
It's especially bad considering you have a tool at your disposal that is specifically built to parse markup, the DOMDocument object.

added 764 characters in body

Source Link

edited Feb 25, 2014 at 14:56

Elias Van Ootegem

edited Feb 25, 2014 at 14:56

Elias Van Ootegem

As a guide-to-the-DOMDocument-docs:

Every class extends from the DOMNode class , except for the DOMNodeList class, which is a Traversable containing only DOMNode instances, accessible through the item method. Anyway, check the DOMNode methods and properties. Learn that class by heart, as almost every object you work with inherits its properties and methods.
The other class to check is the DOMElement class . Instances of this class represent a node (a tag if you will). It has some methods that I've used here (like removeAttribute).
Finally, everything, of course starts with the DOMDocument class . I had already linked to the docs, but an extra link never hurt anyone.

As a guide-to-the-DOMDocument-docs:

added 1457 characters in body

Source Link

edited Feb 25, 2014 at 14:43

Elias Van Ootegem

edited Feb 25, 2014 at 14:43

Elias Van Ootegem

$DOM = new DOMDocument;//create DOM
$DOM->loadHTMLFile($theUrl);//parse HTML
$container = $DOM->getElementById('container');//get the id=container element
if ($container)
{//make sure it was found
 $container->removeAttribute('id');//remove the id attribute
}
$body = $contents = $dom->getElementsByTagName('body')->item(0);//get body
//remove attributes
while ($body->hasAttributes())
{//while tag has attributes
 //remove attribute
 $body->removeAttribute($body->attributes->item(0)->nodeName);
}
//Get the body as string, use saveXML
//as saveHTML adds DOCTYPE, html, head and title tag again
$body = substr($DOM->saveXML($body), 6, -7);

Perhaps this is a bit easier to understand. Anyway: read the DOMDocument docs, there's a lot you can do there. Just click on the methods I used, and click the objects they return, too, so you know what instances you're working with.

Now wrap all this in a function, and we can add some checks to avoid exceptions or errors from showing up:

function getBody($file, DOMDocument $dom = null)
{//optionally pass existing DOMDocument instance
 //to avoid creating a new one each time we call this function
 if (!file_exists($file))
 {//check if the file exists, if not:
 throw new InvalidArgumentException($file. ' does not exist!');
 }
 $dom = $dom instanceof DOMDocument ? $dom : new DOMDocument;
 $dom->loadHTMLFile($file);
 $container = $dom->getElementById('container');
 if ($container)
 $container->removeAttribute('id');
 $body = $dom->getElementsByTagName('body')->item(0);
 while($body->hasAttributes())
 {
 $body->removeAttribute(
 $body->attributes->item(0)->nodeName
 );
 }
 $body = substr($dom->saveXML($body), 6, -7);
 //optionally remove trailing whitespace:
 return trim($body);
}

Call this function either like this:

$body1 = getBody('path/to/file1.html');

or, since you're going to process more than 1 file, and we can avoid creating a new DOMDocument instance for every function call:

$dom = new DOMDocument;
$files = array('file1.html', 'file2.html');
$bodyStrings = array();
foreach ($files as $file)
 $bodyStrings[] = getBody($file, $dom);

That's it.

$DOM = new DOMDocument;//create DOM
$DOM->loadHTMLFile($theUrl);//parse HTML
$container = $DOM->getElementById('container');//get the id=container element
if ($container)
{//make sure it was found
 $container->removeAttribute('id');//remove the id attribute
}
$body = $contents = $dom->getElementsByTagName('body')->item(0);//get body
//remove attributes
while ($body->hasAttributes())
{//while tag has attributes
 //remove attribute
 $body->removeAttribute($body->attributes->item(0)->nodeName);
}
//Get the body as string
$body = substr($DOM->saveXML($body), 6, -7);

$DOM = new DOMDocument;//create DOM
$DOM->loadHTMLFile($theUrl);//parse HTML
$container = $DOM->getElementById('container');//get the id=container element
if ($container)
{//make sure it was found
 $container->removeAttribute('id');//remove the id attribute
}
$body = $contents = $dom->getElementsByTagName('body')->item(0);//get body
//remove attributes
while ($body->hasAttributes())
{//while tag has attributes
 //remove attribute
 $body->removeAttribute($body->attributes->item(0)->nodeName);
}
//Get the body as string, use saveXML
//as saveHTML adds DOCTYPE, html, head and title tag again
$body = substr($DOM->saveXML($body), 6, -7);

Now wrap all this in a function, and we can add some checks to avoid exceptions or errors from showing up:

function getBody($file, DOMDocument $dom = null)
{//optionally pass existing DOMDocument instance
 //to avoid creating a new one each time we call this function
 if (!file_exists($file))
 {//check if the file exists, if not:
 throw new InvalidArgumentException($file. ' does not exist!');
 }
 $dom = $dom instanceof DOMDocument ? $dom : new DOMDocument;
 $dom->loadHTMLFile($file);
 $container = $dom->getElementById('container');
 if ($container)
 $container->removeAttribute('id');
 $body = $dom->getElementsByTagName('body')->item(0);
 while($body->hasAttributes())
 {
 $body->removeAttribute(
 $body->attributes->item(0)->nodeName
 );
 }
 $body = substr($dom->saveXML($body), 6, -7);
 //optionally remove trailing whitespace:
 return trim($body);
}

Call this function either like this:

$body1 = getBody('path/to/file1.html');

or, since you're going to process more than 1 file, and we can avoid creating a new DOMDocument instance for every function call:

$dom = new DOMDocument;
$files = array('file1.html', 'file2.html');
$bodyStrings = array();
foreach ($files as $file)
 $bodyStrings[] = getBody($file, $dom);

That's it.

added 1457 characters in body

Source Link

edited Feb 25, 2014 at 14:28

Elias Van Ootegem

edited Feb 25, 2014 at 14:28

Elias Van Ootegem

Source Link

answered Feb 25, 2014 at 13:56

Elias Van Ootegem

answered Feb 25, 2014 at 13:56

Elias Van Ootegem

lang-php