3

I have a PHP file which produces an Xml sitemap based on data which has been imported from a number of sources. My sitemap is currently not well formed due to an illegal character in one line of the imported data however I am struggling to remove it.

The character looks to represent the 'squared' or superscript 2, and is represented as a square. I have tried pasting this into a hex editor however it is shown as a ?, and the hex code also corresponds to ?. I have also tried using iconv to convert from all source encodings to all destination encodings, with no combination removing this character.

I also have the following function to remove non-ascii characters:

function stripInvalidXml($value)
{
 $ret = "";
 $current;
 if (empty($value)) 
 {
 return $ret;
 }
 $length = strlen($value);
 for ($i=0; $i < $length; $i++)
 {
 $current = ord($value{$i});
 if (($current == 0x9) ||
 ($current == 0xA) ||
 ($current == 0xD) ||
 (($current >= 0x20) && ($current <= 0xD7FF)) ||
 (($current >= 0xE000) && ($current <= 0xFFFD)) ||
 (($current >= 0x10000) && ($current <= 0x10FFFF)))
 {
 if($current != 0x1F)
 {
 $ret .= chr($current);
 }
 }
 else
 {
 $ret .= " ";
 }
 }
 return $ret;
}

However this still is not removing it. If I step through the code the illegal character is expanded out to ￿ in eclipses debug window. The string it is having issues with is below (hoping it pastes correctly)

251gm-50

Any ideas on a function which will remove this character and prevent this form occurring are much appreciated - I have little control over the data that is imported so it needs to be done at the point of Xml generation.

EDIT

After posting I can see that the character doesn't appear correctly. When viewing in Eclipses window it appears as & # 65535 ; (without spaces - if I leave spaces in it renders the character, which looks like ￿)

asked Jul 14, 2010 at 11:59

3 Answers 3

3

You are trying to perform character transcoding. Don't do it by yourself, use the PHP library.

I found iconv quite useful:

$cleanText = iconv('UTF-8','ISO-8859-1//TRANSLIT//IGNORE', $srcText);

This code translates from utf-8 to iso-8859, trying to remap the 'exotic' characters and ignoring the ones that can not be transcoded.

I'm just guessing the source encoding is utf-8. You have to discover which encoding the incoming data is using and translate in the one you are declaring in the XML header.

A linux command line tool that guesses a file's encoding is enca

answered Jul 14, 2010 at 12:10
Sign up to request clarification or add additional context in comments.

2 Comments

I tried iconv with all combinations of encoding for both input and output and it didn't work with any
I changed the encoding from UTF-8 to ISO-8859-1 and it resolved my 4f's in a box issue.
2

This is wrong:

 $current = ord($value{$i});
 if (($current == 0x9) ||
 ($current == 0xA) ||
 ($current == 0xD) ||
 (($current >= 0x20) && ($current <= 0xD7FF)) ||
 (($current >= 0xE000) && ($current <= 0xFFFD)) ||
 (($current >= 0x10000) && ($current <= 0x10FFFF)))
 {
 if($current != 0x1F)
 $ret .= chr($current);
 }

ord() never returns anything bigger than 0xFF since it works in a byte-by-byte manner.

I'm guessing your XML is invalid because the file contains an invalid UTF-8 sequence (indeed &#65535;, i.e., 0xFFFF, is invalid in UTF-8). This probably comes from copy-paste of different XML files that have different encodings.

I suggest you use the DOM extension instead to do your XML mash-up, which handles different encodings automatically by converting them internally to UTF-8.

answered Jul 14, 2010 at 12:05

3 Comments

Good suggestion - I have inherited some code which generates the Xml as a string, DOM would be a far cleaner way of doing this
DOM is maybe overkill for producing something like an RSS feed: he probably doesn't need all the manipulation/search facilities, and for big documents the memory footprint of a DOM structure might be excessive
@lacopo Overkill? In what regard? For manipulating XML, DOM is the best lib PHP has. If memory is an issue, there is XMLWriter. In both cases, the result is more reliable than using string concatenation or reinventing everything those libs do already on their own.
1

I think I was looking down the wrong path - rather than an encoding issue character was an HTML entity representing the 'squared' symbol. As the descriptions in the URL only exist for search enging purposes I can safely remove all htmlentities with the following regex:

$content = preg_replace("/&#?[a-z0-9]+;/i","",$content);
answered Jul 14, 2010 at 12:15

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.