6

Why would this code:

$doc = new DOMDocument();
$doc->loadHTML($this->content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$imgNodes = $doc->getElementsByTagName('img');
if ($imgNodes->length > 0) {
 $inlineImage = new Image();
 $inlineImage->setPublicDir($publicDirPath);
 foreach ($imgNodes as $imgNode) {
 $inlineImage->setUri($imgNode->getAttribute('src'));
 $inlineImage->setName(basename($inlineImage->getUri()));
 if ($inlineImage->getUri() != $dstPath.$inlineImage->getName()) {
 $inlineImage->move($dstPath);
 $imgNode->setAttribute('src', $dstPath.'/'.$inlineImage->getName()); 
 }
 }
 $this->content = $doc->saveHtml();
}

executed on this code:

<p><img alt="fluid cat" src="/images/tmp/fluid-cat.jpg"></p><p><img alt="pandas" src="/images/tmp/pandas.jpg"></p>

result in this code:

<p><img alt="fluid cat" src="/images/full/2016-09/fluid-cat.jpg"><p><img alt="pandas" src="/images/full/2016-09/pandas.jpg"></p></p>

Why does it place both img tags inside the first p block?

Casimir et Hippolyte
89.9k5 gold badges102 silver badges131 bronze badges
asked Sep 13, 2016 at 22:36
3
  • 1
    Because your html sample doesn't have a root element. Libxml assumes that the first p is the root element and performs an automatic fix. It removes the "orphan" closing p tag and puts a closing tag at the "good place", i.e. at the end. To fix the problem, add a fake root element (<div>....</div> for example, or remove LIBXML_HTML_NOIMPLIED) and extract its child nodes one by one to create the result string by concatenation. Commented Sep 13, 2016 at 23:43
  • I'm pretty sure DomDocument tries to correctly format things for HTML. Try adding a / at the end of your img tag to make it self closing Commented Sep 14, 2016 at 0:04
  • loadHTML() and saveHTML() are terribly broken and useless in practice. Consider using a third-party HTML parser like html5lib-php and a custom HTML-code generator. Commented Sep 14, 2016 at 0:56

1 Answer 1

15

Your html sample doesn't have a root element that surrounds all. When LIBXML parses the html to build the DOM tree, it assumes that the first encountered tag is the root element. Consequence, the first tag </p> is seen as an orphan closing tag (because there's content after it) and is automatically removed, and a </p> is added at the end to close the root element.

To avoid these automatic fixes when you are working with html parts (not a whole html document), you need to add a fake root element. At the end, to produce the result string, you need to save each childnode of this fake root element. Example:

$html = '<p><img alt="fluid cat" src="/images/tmp/fluid-cat.jpg"></p><p><img alt="pandas" src="/images/tmp/pandas.jpg"></p>';
$doc = new DOMDocument;
$doc->loadHTML( '<div>' . $html . '</div>', LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
# ^-----------------^----- fake root element
$root = $doc->documentElement;
$result = '';
foreach($root->childNodes as $childNode) {
 $result .= $doc->saveHTML($childNode);
}
echo $result;
answered Sep 14, 2016 at 0:33

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.