PHP Simplest XML tokenizer for string processing?

Question 1

I need to do some text processing tasks with short XML fragments. The section "details" shows an example... My solution is to use a tokenizer based on regular expresions, but it is not elegant and not uses any build-in function. The candidate build-in functions (that others indicates to me) are strtok and SimpleXML.

So, my question have two parts:

My assumptions are correct? There are no other "candidates", only strtok and SimpleXML? Is correct to think DomDocument as an "elephant" (with big CPU overhead) for a simple text processing task?
How to use SimpleXML to do the same (illustred) task? PS: I not need all algorithm or implementation, only some clues.

Details

PHP offers a tokenizer, strtok, very simple, and I not see how to use it with XML string. The option, SimpleXML, is perhaps heavy for tasks like ''text processing'' (see this example and the below), and is more than a tokenizer.

What exactly I whant to say with "tokenizer" and "text processing"? See example below. I used a "regex parser", but I would like an algorithm based on some built-in function like SimpleXML, if it is simpler and faster.

 $xmlFrag = '
 <p align="center">&nbsp; Hello world!</p> 
 <p class="test"><i>&nbsp; Beautiful</i> day today.</p>';
 
 // TOKENIZING TAGS AND ENTITIES:
 $reg=array();
 $xmlFrag = preg_replace_callback(
 '/<!\-\-.+?\-\->|<.+?>|&[a-z0-9]+;/is', 
 function ($m) {
 global $reg; $reg[]=$m[0]; $n=count($reg)-1;
 return "##$n#";
 },
 $xmlFrag
 );
 echo $xmlFrag; // results:
 // ##0###1# Hello world!##2# ##3###4###5# Beautiful##6# day today.##7#
 // PROCESS THE TEXT: any, in one step. Example: lower, upper, change orthography, etc.
 $xmlFrag = strtoupper($xmlFrag);
 echo $xmlFrag;
 // ##0###1# HELLO WORLD!##2# ##3###4###5# BEAUTIFUL##6# DAY TODAY.##7#
 
 // EXPAND TOKENS:
 $xmlFrag = preg_replace_callback(
 '/##([0-9]+)#/is', 
 function ($m) { global $reg; return $reg[$m[1]]; },
 $xmlFrag
 );
 echo $xmlFrag;
 // <p align="center"> HELLO WORLD!</p> 
 // <p class="test"><i> BEAUTIFUL</i> DAY TODAY.</p>

Using SimpleXML

How to implement a SimpleXML algorithm to resolve the illustrated problem (code above)? PROBLEMS:

Load a XML with named entities (as   in the example).
Traverse XML to get only text nodes. With $sx->xpath('//text()'); I can not edit the nodes.

Using DomDocument

It is out of context, because my XML fragments are short and DomDocument imposes a big CPU overhead (? is this a preconception?) for the simple text processing.

Question 2

It's possible you have a design or conceptual programming problem here, but it's difficult to understand what you're asking. Please edit your question to be more clear about what aspects you're having issues with.

Question 3

I edited, but can add "PHP Simplest XML tokenizer for string processing? using regex loops or SimpleXML, or another solution?"

Question 4

Why not just use SimpleXML?

Question 5

I edited, see my problems when I try to use.

Question 6

Sorry, I find (!), and have a misconception about DomDocument (it is not an "elephant"):

DomDocument have comparable performance (times) with SimpleXML.

There are a simple algorithm to solve my problem!

 $dom = new DOMDocument;
 $dom->loadXML($xmlFrag); 
 $xpath = new DOMXpath($dom);
 $elements = $xpath->query("//text()");
 foreach ($elements as $element) // loop for text processing:
 $element->nodeValue = strtoupper($element->nodeValue);
 print $dom->saveXML();

Performance (execution times for a 10000 loop):

DOMDocument (implementation above): 0.481 seconds;
regex tokenizer (implemention of the question): 0.571 seconds;
SimpleXML (no implementation, only simulating iniciatizations and process): 0.481 seconds.
@IMSoP shows that "traverse and edit" is more complex with SimpleXML than DOM.

Question 7

SimpleXML and DOM will have nearly identical performance, since they are actually both wrappers around the same parser (libxml2), and can even be used interchangeably with simplexml_import_dom and dom_import_simplexml, which re-wrap the parsed document representation. Using manual tokenization or regular expressions rather than a dedicated parser for XML is pretty much always a bad idea, as it is a complex syntax, and there will be cases you don't account for.

Question 8

Thanks, your explanation complements the benchmarks and our understand about it. About SimpleXML implementation of this algorithm, you show that it is more complex than DOM (thanks also about it!), need recursive function, etc. So, DOM is really the better solution.

score 2 · Answer 1 · 2013-07-12 19:21:24Z

Sorry, I find (!), and have a misconception about DomDocument (it is not an "elephant"):

DomDocument have comparable performance (times) with SimpleXML.

There are a simple algorithm to solve my problem!

 $dom = new DOMDocument;
 $dom->loadXML($xmlFrag); 
 $xpath = new DOMXpath($dom);
 $elements = $xpath->query("//text()");
 foreach ($elements as $element) // loop for text processing:
 $element->nodeValue = strtoupper($element->nodeValue);
 print $dom->saveXML();

Performance (execution times for a 10000 loop):

DOMDocument (implementation above): 0.481 seconds;
regex tokenizer (implemention of the question): 0.571 seconds;
SimpleXML (no implementation, only simulating iniciatizations and process): 0.481 seconds.
@IMSoP shows that "traverse and edit" is more complex with SimpleXML than DOM.

SimpleXML and DOM will have nearly identical performance, since they are actually both wrappers around the same parser (libxml2), and can even be used interchangeably with simplexml_import_dom and dom_import_simplexml, which re-wrap the parsed document representation. Using manual tokenization or regular expressions rather than a dedicated parser for XML is pretty much always a bad idea, as it is a complex syntax, and there will be cases you don't account for.
Thanks, your explanation complements the benchmarks and our understand about it. About SimpleXML implementation of this algorithm, you show that it is more complex than DOM (thanks also about it!), need recursive function, etc. So, DOM is really the better solution.

Stack Exchange Network

PHP Simplest XML tokenizer for string processing?

Details

Using SimpleXML

Using DomDocument

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

PHP Simplest XML tokenizer for string processing?

Details

Using SimpleXML

Using DomDocument

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions