Illegal character in Xml

Question 1

I have a PHP file which produces an Xml sitemap based on data which has been imported from a number of sources. My sitemap is currently not well formed due to an illegal character in one line of the imported data however I am struggling to remove it.

The character looks to represent the 'squared' or superscript 2, and is represented as a square. I have tried pasting this into a hex editor however it is shown as a ?, and the hex code also corresponds to ?. I have also tried using iconv to convert from all source encodings to all destination encodings, with no combination removing this character.

I also have the following function to remove non-ascii characters:

function stripInvalidXml($value)
{
 $ret = "";
 $current;
 if (empty($value)) 
 {
 return $ret;
 }
 $length = strlen($value);
 for ($i=0; $i < $length; $i++)
 {
 $current = ord($value{$i});
 if (($current == 0x9) ||
 ($current == 0xA) ||
 ($current == 0xD) ||
 (($current >= 0x20) && ($current <= 0xD7FF)) ||
 (($current >= 0xE000) && ($current <= 0xFFFD)) ||
 (($current >= 0x10000) && ($current <= 0x10FFFF)))
 {
 if($current != 0x1F)
 {
 $ret .= chr($current);
 }
 }
 else
 {
 $ret .= " ";
 }
 }
 return $ret;
}

However this still is not removing it. If I step through the code the illegal character is expanded out to in eclipses debug window. The string it is having issues with is below (hoping it pastes correctly)

251gm-50

Any ideas on a function which will remove this character and prevent this form occurring are much appreciated - I have little control over the data that is imported so it needs to be done at the point of Xml generation.

EDIT

After posting I can see that the character doesn't appear correctly. When viewing in Eclipses window it appears as & # 65535 ; (without spaces - if I leave spaces in it renders the character, which looks like )

Question 2

You are trying to perform character transcoding. Don't do it by yourself, use the PHP library.

I found iconv quite useful:

$cleanText = iconv('UTF-8','ISO-8859-1//TRANSLIT//IGNORE', $srcText);

This code translates from utf-8 to iso-8859, trying to remap the 'exotic' characters and ignoring the ones that can not be transcoded.

I'm just guessing the source encoding is utf-8. You have to discover which encoding the incoming data is using and translate in the one you are declaring in the XML header.

A linux command line tool that guesses a file's encoding is enca

Question 3

I tried iconv with all combinations of encoding for both input and output and it didn't work with any

Question 4

I changed the encoding from UTF-8 to ISO-8859-1 and it resolved my 4f's in a box issue.

Question 5

This is wrong:

 $current = ord($value{$i});
 if (($current == 0x9) ||
 ($current == 0xA) ||
 ($current == 0xD) ||
 (($current >= 0x20) && ($current <= 0xD7FF)) ||
 (($current >= 0xE000) && ($current <= 0xFFFD)) ||
 (($current >= 0x10000) && ($current <= 0x10FFFF)))
 {
 if($current != 0x1F)
 $ret .= chr($current);
 }

ord() never returns anything bigger than 0xFF since it works in a byte-by-byte manner.

I'm guessing your XML is invalid because the file contains an invalid UTF-8 sequence (indeed , i.e., 0xFFFF, is invalid in UTF-8). This probably comes from copy-paste of different XML files that have different encodings.

I suggest you use the DOM extension instead to do your XML mash-up, which handles different encodings automatically by converting them internally to UTF-8.

Question 6

Good suggestion - I have inherited some code which generates the Xml as a string, DOM would be a far cleaner way of doing this

Question 7

DOM is maybe overkill for producing something like an RSS feed: he probably doesn't need all the manipulation/search facilities, and for big documents the memory footprint of a DOM structure might be excessive

Question 8

@lacopo Overkill? In what regard? For manipulating XML, DOM is the best lib PHP has. If memory is an issue, there is XMLWriter. In both cases, the result is more reliable than using string concatenation or reinventing everything those libs do already on their own.

Question 9

I think I was looking down the wrong path - rather than an encoding issue character was an HTML entity representing the 'squared' symbol. As the descriptions in the URL only exist for search enging purposes I can safely remove all htmlentities with the following regex:

$content = preg_replace("/&#?[a-z0-9]+;/i","",$content);

Iacopo 4,3221 gold badge25 silver badges25 bronze badges · Accepted Answer · 2010-07-14 12:10:54Z

You are trying to perform character transcoding. Don't do it by yourself, use the PHP library.

I found iconv quite useful:

$cleanText = iconv('UTF-8','ISO-8859-1//TRANSLIT//IGNORE', $srcText);

This code translates from utf-8 to iso-8859, trying to remap the 'exotic' characters and ignoring the ones that can not be transcoded.

I'm just guessing the source encoding is utf-8. You have to discover which encoding the incoming data is using and translate in the one you are declaring in the XML header.

A linux command line tool that guesses a file's encoding is enca

I tried iconv with all combinations of encoding for both input and output and it didn't work with any
I changed the encoding from UTF-8 to ISO-8859-1 and it resolved my 4f's in a box issue.

CollectivesTM on Stack Overflow

Illegal character in Xml

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

2 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related