Programming Tutorials

(追記) (追記ここまで)

Perl's Encoding::FixLatin equivalent in PHP

By: squeegee in PHP Tutorials on 2011年07月31日 [フレーム]

I think this is a reasonable port of Perl's Encoding::FixLatin by Grant McLean, which converts a string with mixed encodings (ASCII, ISO-8859-1, CP1252, and UTF-8) to UTF-8.

<?php
function init_byte_map(){
 global $byte_map;
 for($x=128;$x<256;++$x){
 $byte_map[chr($x)]=utf8_encode(chr($x));
 }
 $cp1252_map=array(
 "x80"=>"xE2x82xAC", // EURO SIGN
 "x82" => "xE2x80x9A", // SINGLE LOW-9 QUOTATION MARK
 "x83" => "xC6x92", // LATIN SMALL LETTER F WITH HOOK
 "x84" => "xE2x80x9E", // DOUBLE LOW-9 QUOTATION MARK
 "x85" => "xE2x80xA6", // HORIZONTAL ELLIPSIS
 "x86" => "xE2x80xA0", // DAGGER
 "x87" => "xE2x80xA1", // DOUBLE DAGGER
 "x88" => "xCBx86", // MODIFIER LETTER CIRCUMFLEX ACCENT
 "x89" => "xE2x80xB0", // PER MILLE SIGN
 "x8A" => "xC5xA0", // LATIN CAPITAL LETTER S WITH CARON
 "x8B" => "xE2x80xB9", // SINGLE LEFT-POINTING ANGLE QUOTATION MARK
 "x8C" => "xC5x92", // LATIN CAPITAL LIGATURE OE
 "x8E" => "xC5xBD", // LATIN CAPITAL LETTER Z WITH CARON
 "x91" => "xE2x80x98", // LEFT SINGLE QUOTATION MARK
 "x92" => "xE2x80x99", // RIGHT SINGLE QUOTATION MARK
 "x93" => "xE2x80x9C", // LEFT DOUBLE QUOTATION MARK
 "x94" => "xE2x80x9D", // RIGHT DOUBLE QUOTATION MARK
 "x95" => "xE2x80xA2", // BULLET
 "x96" => "xE2x80x93", // EN DASH
 "x97" => "xE2x80x94", // EM DASH
 "x98" => "xCBx9C", // SMALL TILDE
 "x99" => "xE2x84xA2", // TRADE MARK SIGN
 "x9A" => "xC5xA1", // LATIN SMALL LETTER S WITH CARON
 "x9B" => "xE2x80xBA", // SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
 "x9C" => "xC5x93", // LATIN SMALL LIGATURE OE
 "x9E" => "xC5xBE", // LATIN SMALL LETTER Z WITH CARON
 "x9F" => "xC5xB8" // LATIN CAPITAL LETTER Y WITH DIAERESIS
 );
 foreach($cp1252_map as $k=>$v){
 $byte_map[$k]=$v;
 }
}
function fix_latin($instr){
 if(mb_check_encoding($instr,'UTF-8'))return $instr; // no need for the rest if it's all valid UTF-8 already
 global $nibble_good_chars,$byte_map;
 $outstr='';
 $char='';
 $rest='';
 while((strlen($instr))>0){
 if(1==preg_match($nibble_good_chars,$input,$match)){
 $char=$match[1];
 $rest=$match[2];
 $outstr.=$char;
 }elseif(1==preg_match('@^(.)(.*)$@s',$input,$match)){
 $char=$match[1];
 $rest=$match[2];
 $outstr.=$byte_map[$char];
 }
 $instr=$rest;
 }
 return $outstr;
}
$byte_map=array();
init_byte_map();
$ascii_char='[x00-x7F]';
$cont_byte='[x80-xBF]';
$utf8_2='[xC0-xDF]'.$cont_byte;
$utf8_3='[xE0-xEF]'.$cont_byte.'{2}';
$utf8_4='[xF0-xF7]'.$cont_byte.'{3}';
$utf8_5='[xF8-xFB]'.$cont_byte.'{4}';
$nibble_good_chars = "@^($ascii_char+|$utf8_2|$utf8_3|$utf8_4|$utf8_5)(.*)$@s";
?>

Then just call fix_latin wherever you need it.




(追記) (追記ここまで)


Add Comment

JavaScript must be enabled for certain features to work
* Required information
1000

Comments

No comments yet. Be the first!
(追記) (追記ここまで)
(追記) (追記ここまで)

AltStyle によって変換されたページ (->オリジナル) /