PHP DOMDocument saveHTML breaks format

Question 1

Why would this code:

$doc = new DOMDocument();
$doc->loadHTML($this->content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$imgNodes = $doc->getElementsByTagName('img');
if ($imgNodes->length > 0) {
 $inlineImage = new Image();
 $inlineImage->setPublicDir($publicDirPath);
 foreach ($imgNodes as $imgNode) {
 $inlineImage->setUri($imgNode->getAttribute('src'));
 $inlineImage->setName(basename($inlineImage->getUri()));
 if ($inlineImage->getUri() != $dstPath.$inlineImage->getName()) {
 $inlineImage->move($dstPath);
 $imgNode->setAttribute('src', $dstPath.'/'.$inlineImage->getName()); 
 }
 }
 $this->content = $doc->saveHtml();
}

executed on this code:

<p><img alt="fluid cat" src="/images/tmp/fluid-cat.jpg"></p><p><img alt="pandas" src="/images/tmp/pandas.jpg"></p>

result in this code:

<p><img alt="fluid cat" src="/images/full/2016-09/fluid-cat.jpg"><p><img alt="pandas" src="/images/full/2016-09/pandas.jpg"></p></p>

Why does it place both img tags inside the first p block?

Question 2

Because your html sample doesn't have a root element. Libxml assumes that the first p is the root element and performs an automatic fix. It removes the "orphan" closing p tag and puts a closing tag at the "good place", i.e. at the end. To fix the problem, add a fake root element (<div>....</div> for example, or remove LIBXML_HTML_NOIMPLIED) and extract its child nodes one by one to create the result string by concatenation.

Question 3

I'm pretty sure DomDocument tries to correctly format things for HTML. Try adding a / at the end of your img tag to make it self closing

Question 4

loadHTML() and saveHTML() are terribly broken and useless in practice. Consider using a third-party HTML parser like html5lib-php and a custom HTML-code generator.

Question 5

Your html sample doesn't have a root element that surrounds all. When LIBXML parses the html to build the DOM tree, it assumes that the first encountered tag is the root element. Consequence, the first tag </p> is seen as an orphan closing tag (because there's content after it) and is automatically removed, and a </p> is added at the end to close the root element.

To avoid these automatic fixes when you are working with html parts (not a whole html document), you need to add a fake root element. At the end, to produce the result string, you need to save each childnode of this fake root element. Example:

$html = '<p><img alt="fluid cat" src="/images/tmp/fluid-cat.jpg"></p><p><img alt="pandas" src="/images/tmp/pandas.jpg"></p>';
$doc = new DOMDocument;
$doc->loadHTML( '<div>' . $html . '</div>', LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
# ^-----------------^----- fake root element
$root = $doc->documentElement;
$result = '';
foreach($root->childNodes as $childNode) {
 $result .= $doc->saveHTML($childNode);
}
echo $result;

score 15 · Accepted Answer · 2016-09-14 00:33:13Z

Your html sample doesn't have a root element that surrounds all. When LIBXML parses the html to build the DOM tree, it assumes that the first encountered tag is the root element. Consequence, the first tag </p> is seen as an orphan closing tag (because there's content after it) and is automatically removed, and a </p> is added at the end to close the root element.

To avoid these automatic fixes when you are working with html parts (not a whole html document), you need to add a fake root element. At the end, to produce the result string, you need to save each childnode of this fake root element. Example:

$html = '<p><img alt="fluid cat" src="/images/tmp/fluid-cat.jpg"></p><p><img alt="pandas" src="/images/tmp/pandas.jpg"></p>';
$doc = new DOMDocument;
$doc->loadHTML( '<div>' . $html . '</div>', LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
# ^-----------------^----- fake root element
$root = $doc->documentElement;
$result = '';
foreach($root->childNodes as $childNode) {
 $result .= $doc->saveHTML($childNode);
}
echo $result;

CollectivesTM on Stack Overflow

PHP DOMDocument saveHTML breaks format

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related