I struggled for the last couple days on this question and finally came up with a solution. But it's got so many loops I get dizzy just looking at the code - from recursive functions with loops to loops within loops.
I really expected to find a canned function in a popular XML class to extract all the unique nodes without data, but I am surprised how non-trivial I've found this endeavor.
While this works for my piddly example data, I have performance and scalability concerns when trying to use this code in production. For example, this elegant answer uses XSLT to solve the problem, but it's limited by node depth. So is there anyway to improve efficiency by cutting out a loop, or using less objects, or even utilizing a different class or technology completely? Are there any glaring failing conditions that aren't represented in my example data?
<?php
$xml = '
<root>
<node/>
<node>
<sub>more</sub>
</node>
<node>
<sub>another</sub>
</node>
<node>value</node>
</root>
';
$doc = new DOMDocument();
$doc->loadXML($xml);
// clone without data
$empty_xml = new DOMDocument();
$empty_xml->appendChild($empty_xml->importNode($doc->documentElement));
function clone_without_data(&$orig, &$clone, &$clonedoc){
foreach ($orig->childNodes as $child){
if(get_class($child) === "DOMElement")
$new_node = $clone->appendChild($clonedoc->importNode($child));
if($child->hasChildNodes())
clone_without_data($child,$new_node,$clonedoc);
}
}
clone_without_data($doc->documentElement, $empty_xml->documentElement, $empty_xml);
// remove all duplicates
$distinct_structure = new DOMDocument();
$distinct_structure->appendChild($distinct_structure->importNode($doc->documentElement));
foreach ($empty_xml->documentElement->childNodes as $child){
$match = false;
foreach ($distinct_structure->documentElement->childNodes as $element){
if ($distinct_structure->saveXML($element) === $empty_xml->saveXML($child)) {
$match = true;
break;
}
}
if (!$match)
$distinct_structure->documentElement->appendChild($distinct_structure->importNode($child,true));
}
$distinct_structure->formatOutput = true;
echo $distinct_structure->saveXML();
This results in the unique XML structures stripped of all data:
<?xml version="1.0"?>
<root>
<node/>
<node>
<sub/>
</node>
</root>
1 Answer 1
I would consider getting all nodes using getElementsByTagName('*')
(note wildcard usage).
I would also consider simply cloning the $doc
object.
That could simplify to this:
$doc = new DOMDocument();
$doc->loadXML($xml);
$clone = cloneDOMDocument($doc, true);
function cloneDOMDocument(DOMDocument $source, $remove_values = false) {
$clone = clone $source;
if($remove_values === true) {
// select all nodes in clone
$all_nodes = $clone->getElementsByTagName('*');
// iterate all nodes in clone, resetting leaf values to empty string
foreach($all_nodes as $node) {
if($node->hasChildNodes() === false) {
$node->nodeValue = '';
}
}
}
return $clone;
}
-
\$\begingroup\$ if you set
$node->nodeValue = '';
then that wipes out all text and sub-nodes too, so after processing, you're left with only<root></root>
\$\endgroup\$Jeff Puckett– Jeff Puckett2016年06月17日 15:41:56 +00:00Commented Jun 17, 2016 at 15:41 -
\$\begingroup\$
$doc
is out of scope for your function... \$\endgroup\$Jeff Puckett– Jeff Puckett2016年06月17日 15:45:02 +00:00Commented Jun 17, 2016 at 15:45 -
\$\begingroup\$ @JeffPuckettII Yes. Bad oversight on my part. Made a change in code example to only reset value if node has no children. Also just forgot to change the variable name when moving it into function. That is corrected to reference
$source
instead of$doc
. \$\endgroup\$Mike Brant– Mike Brant2016年06月17日 15:45:11 +00:00Commented Jun 17, 2016 at 15:45 -
\$\begingroup\$ Thanks for the suggestion Mike, but as it is, this doesn't work.
Notice: Undefined property: DOMElement::$hasChildNodes
and even if you change that toif($node->hasChildNodes()
, then it still has all the text in the nodes you can see fromecho $clone->saveXML();
\$\endgroup\$Jeff Puckett– Jeff Puckett2016年06月17日 16:09:51 +00:00Commented Jun 17, 2016 at 16:09
$xml = '...
\$\endgroup\$