Ouch, The first thing I noticed is that you're processing markup, and have chosen to do so using regular expressions. That's a bad idea That's a bad idea from the off.
It's especially bad considering you have a tool at your disposal that is specifically built to parse markup, the DOMDocument
object.
Ouch, The first thing I noticed is that you're processing markup, and have chosen to do so using regular expressions. That's a bad idea from the off.
It's especially bad considering you have a tool at your disposal that is specifically built to parse markup, the DOMDocument
object.
Ouch, The first thing I noticed is that you're processing markup, and have chosen to do so using regular expressions. That's a bad idea from the off.
It's especially bad considering you have a tool at your disposal that is specifically built to parse markup, the DOMDocument
object.
As a guide-to-the-DOMDocument-docs:
Every class extends from the DOMNode
class , except for the DOMNodeList
class, which is a Traversable
containing only DOMNode
instances, accessible through the item
method. Anyway, check the DOMNode
methods and properties. Learn that class by heart, as almost every object you work with inherits its properties and methods.
The other class to check is the DOMElement
class . Instances of this class represent a node (a tag if you will). It has some methods that I've used here (like removeAttribute
).
Finally, everything, of course starts with the DOMDocument
class . I had already linked to the docs, but an extra link never hurt anyone.
As a guide-to-the-DOMDocument-docs:
Every class extends from the DOMNode
class , except for the DOMNodeList
class, which is a Traversable
containing only DOMNode
instances, accessible through the item
method. Anyway, check the DOMNode
methods and properties. Learn that class by heart, as almost every object you work with inherits its properties and methods.
The other class to check is the DOMElement
class . Instances of this class represent a node (a tag if you will). It has some methods that I've used here (like removeAttribute
).
Finally, everything, of course starts with the DOMDocument
class . I had already linked to the docs, but an extra link never hurt anyone.
$DOM = new DOMDocument;//create DOM
$DOM->loadHTMLFile($theUrl);//parse HTML
$container = $DOM->getElementById('container');//get the id=container element
if ($container)
{//make sure it was found
$container->removeAttribute('id');//remove the id attribute
}
$body = $contents = $dom->getElementsByTagName('body')->item(0);//get body
//remove attributes
while ($body->hasAttributes())
{//while tag has attributes
//remove attribute
$body->removeAttribute($body->attributes->item(0)->nodeName);
}
//Get the body as string, use saveXML
//as saveHTML adds DOCTYPE, html, head and title tag again
$body = substr($DOM->saveXML($body), 6, -7);
Perhaps this is a bit easier to understand. Anyway: read the DOMDocument
docs, there's a lot you can do there. Just click on the methods I used, and click the objects they return, too, so you know what instances you're working with.
Now wrap all this in a function, and we can add some checks to avoid exceptions or errors from showing up:
function getBody($file, DOMDocument $dom = null)
{//optionally pass existing DOMDocument instance
//to avoid creating a new one each time we call this function
if (!file_exists($file))
{//check if the file exists, if not:
throw new InvalidArgumentException($file. ' does not exist!');
}
$dom = $dom instanceof DOMDocument ? $dom : new DOMDocument;
$dom->loadHTMLFile($file);
$container = $dom->getElementById('container');
if ($container)
$container->removeAttribute('id');
$body = $dom->getElementsByTagName('body')->item(0);
while($body->hasAttributes())
{
$body->removeAttribute(
$body->attributes->item(0)->nodeName
);
}
$body = substr($dom->saveXML($body), 6, -7);
//optionally remove trailing whitespace:
return trim($body);
}
Call this function either like this:
$body1 = getBody('path/to/file1.html');
or, since you're going to process more than 1 file, and we can avoid creating a new DOMDocument
instance for every function call:
$dom = new DOMDocument;
$files = array('file1.html', 'file2.html');
$bodyStrings = array();
foreach ($files as $file)
$bodyStrings[] = getBody($file, $dom);
That's it.
$DOM = new DOMDocument;//create DOM
$DOM->loadHTMLFile($theUrl);//parse HTML
$container = $DOM->getElementById('container');//get the id=container element
if ($container)
{//make sure it was found
$container->removeAttribute('id');//remove the id attribute
}
$body = $contents = $dom->getElementsByTagName('body')->item(0);//get body
//remove attributes
while ($body->hasAttributes())
{//while tag has attributes
//remove attribute
$body->removeAttribute($body->attributes->item(0)->nodeName);
}
//Get the body as string
$body = substr($DOM->saveXML($body), 6, -7);
Perhaps this is a bit easier to understand. Anyway: read the DOMDocument
docs, there's a lot you can do there. Just click on the methods I used, and click the objects they return, too, so you know what instances you're working with.
$DOM = new DOMDocument;//create DOM
$DOM->loadHTMLFile($theUrl);//parse HTML
$container = $DOM->getElementById('container');//get the id=container element
if ($container)
{//make sure it was found
$container->removeAttribute('id');//remove the id attribute
}
$body = $contents = $dom->getElementsByTagName('body')->item(0);//get body
//remove attributes
while ($body->hasAttributes())
{//while tag has attributes
//remove attribute
$body->removeAttribute($body->attributes->item(0)->nodeName);
}
//Get the body as string, use saveXML
//as saveHTML adds DOCTYPE, html, head and title tag again
$body = substr($DOM->saveXML($body), 6, -7);
Perhaps this is a bit easier to understand. Anyway: read the DOMDocument
docs, there's a lot you can do there. Just click on the methods I used, and click the objects they return, too, so you know what instances you're working with.
Now wrap all this in a function, and we can add some checks to avoid exceptions or errors from showing up:
function getBody($file, DOMDocument $dom = null)
{//optionally pass existing DOMDocument instance
//to avoid creating a new one each time we call this function
if (!file_exists($file))
{//check if the file exists, if not:
throw new InvalidArgumentException($file. ' does not exist!');
}
$dom = $dom instanceof DOMDocument ? $dom : new DOMDocument;
$dom->loadHTMLFile($file);
$container = $dom->getElementById('container');
if ($container)
$container->removeAttribute('id');
$body = $dom->getElementsByTagName('body')->item(0);
while($body->hasAttributes())
{
$body->removeAttribute(
$body->attributes->item(0)->nodeName
);
}
$body = substr($dom->saveXML($body), 6, -7);
//optionally remove trailing whitespace:
return trim($body);
}
Call this function either like this:
$body1 = getBody('path/to/file1.html');
or, since you're going to process more than 1 file, and we can avoid creating a new DOMDocument
instance for every function call:
$dom = new DOMDocument;
$files = array('file1.html', 'file2.html');
$bodyStrings = array();
foreach ($files as $file)
$bodyStrings[] = getBody($file, $dom);
That's it.