I need to do some text processing tasks with short XML fragments. The section "details" shows an example... My solution is to use a tokenizer based on regular expresions, but it is not elegant and not uses any build-in function. The candidate build-in functions (that others indicates to me) are strtok and SimpleXML.
So, my question have two parts:
My assumptions are correct? There are no other "candidates", only strtok and SimpleXML? Is correct to think DomDocument as an "elephant" (with big CPU overhead) for a simple text processing task?
How to use SimpleXML to do the same (illustred) task? PS: I not need all algorithm or implementation, only some clues.
Details
PHP offers a tokenizer, strtok, very simple, and I not see how to use it with XML string. The option, SimpleXML, is perhaps heavy for tasks like ''text processing'' (see this example and the below), and is more than a tokenizer.
What exactly I whant to say with "tokenizer" and "text processing"? See example below. I used a "regex parser", but I would like an algorithm based on some built-in function like SimpleXML, if it is simpler and faster.
$xmlFrag = '
<p align="center"> Hello world!</p>
<p class="test"><i> Beautiful</i> day today.</p>';
// TOKENIZING TAGS AND ENTITIES:
$reg=array();
$xmlFrag = preg_replace_callback(
'/<!\-\-.+?\-\->|<.+?>|&[a-z0-9]+;/is',
function ($m) {
global $reg; $reg[]=$m[0]; $n=count($reg)-1;
return "##$n#";
},
$xmlFrag
);
echo $xmlFrag; // results:
// ##0###1# Hello world!##2# ##3###4###5# Beautiful##6# day today.##7#
// PROCESS THE TEXT: any, in one step. Example: lower, upper, change orthography, etc.
$xmlFrag = strtoupper($xmlFrag);
echo $xmlFrag;
// ##0###1# HELLO WORLD!##2# ##3###4###5# BEAUTIFUL##6# DAY TODAY.##7#
// EXPAND TOKENS:
$xmlFrag = preg_replace_callback(
'/##([0-9]+)#/is',
function ($m) { global $reg; return $reg[$m[1]]; },
$xmlFrag
);
echo $xmlFrag;
// <p align="center"> HELLO WORLD!</p>
// <p class="test"><i> BEAUTIFUL</i> DAY TODAY.</p>
Using SimpleXML
How to implement a SimpleXML algorithm to resolve the illustrated problem (code above)? PROBLEMS:
- Load a XML with named entities (as
in the example). - Traverse XML to get only text nodes. With
$sx->xpath('//text()');
I can not edit the nodes.
Using DomDocument
It is out of context, because my XML fragments are short and DomDocument imposes a big CPU overhead (? is this a preconception?) for the simple text processing.
-
It's possible you have a design or conceptual programming problem here, but it's difficult to understand what you're asking. Please edit your question to be more clear about what aspects you're having issues with.user53019– user530192013年07月12日 03:26:31 +00:00Commented Jul 12, 2013 at 3:26
-
I edited, but can add "PHP Simplest XML tokenizer for string processing? using regex loops or SimpleXML, or another solution?"Peter Krauss– Peter Krauss2013年07月12日 03:29:36 +00:00Commented Jul 12, 2013 at 3:29
-
Why not just use SimpleXML?Robert Harvey– Robert Harvey2013年07月12日 04:30:11 +00:00Commented Jul 12, 2013 at 4:30
-
I edited, see my problems when I try to use.Peter Krauss– Peter Krauss2013年07月12日 11:35:03 +00:00Commented Jul 12, 2013 at 11:35
1 Answer 1
Sorry, I find (!), and have a misconception about DomDocument (it is not an "elephant"):
DomDocument have comparable performance (times) with SimpleXML.
There are a simple algorithm to solve my problem!
$dom = new DOMDocument; $dom->loadXML($xmlFrag); $xpath = new DOMXpath($dom); $elements = $xpath->query("//text()"); foreach ($elements as $element) // loop for text processing: $element->nodeValue = strtoupper($element->nodeValue); print $dom->saveXML();
Performance (execution times for a 10000 loop):
- DOMDocument (implementation above): 0.481 seconds;
- regex tokenizer (implemention of the question): 0.571 seconds;
- SimpleXML (no implementation, only simulating iniciatizations and process): 0.481 seconds.
@IMSoP shows that "traverse and edit" is more complex with SimpleXML than DOM.
-
SimpleXML and DOM will have nearly identical performance, since they are actually both wrappers around the same parser (libxml2), and can even be used interchangeably with
simplexml_import_dom
anddom_import_simplexml
, which re-wrap the parsed document representation. Using manual tokenization or regular expressions rather than a dedicated parser for XML is pretty much always a bad idea, as it is a complex syntax, and there will be cases you don't account for.IMSoP– IMSoP2013年07月15日 15:30:25 +00:00Commented Jul 15, 2013 at 15:30 -
Thanks, your explanation complements the benchmarks and our understand about it. About SimpleXML implementation of this algorithm, you show that it is more complex than DOM (thanks also about it!), need recursive function, etc. So, DOM is really the better solution.Peter Krauss– Peter Krauss2013年07月15日 20:27:44 +00:00Commented Jul 15, 2013 at 20:27