Fixing broken UTF-8 encoding

Question 1

I am in the process of fixing some bad UTF-8 encoding. I am currently using PHP 5 and MySQL.

In my database I have a few instances of bad encodings that print like: ÃƒÂ®

The database collation is utf8_general_ci
PHP is using a proper UTF-8 header
Notepad++ is set to use UTF-8 without BOM
database management is handled in phpMyAdmin
not all cases of accented characters are broken

I need some sort of function that will help me map the instances of ÃƒÂ®, ÃƒÂ, ÃƒÂ1⁄4 and others like it to their proper accented UTF-8 characters.

Question 2

Perhaps you could list the characters those are supposed to represent? And maybe a hex dump?

Question 3

A quick look seems to suggest that your strings might have been "double" utf-8 encoded. I.e. encoded in utf-8, those bytes taken as unicode characters, and the result encoded in utf-8. Going backwards: "ÃƒÂ®"="\xC3\x83\xC2\xAE" <-(utf-8)- "\xC3\xAE" <-(utf-8)- "\xEE" = "î". Or perhaps not -- not much data to diagnose here.

Question 4

It is possible that it was double encoded. Is there a safe way to programatically check this, and if so what is the best way to safely decode the double encoding?

Question 5

Yes, Jayrox, check out my answer below.

Question 6

one of the problems afaik is utf8_general_ci which apparently will not guarantee good UTF8 stackoverflow.com/a/1036459/183677. Also those characters you mention are valid UTF8 hexutf8.com/… (but I realize its probably just what you're seeing in console or whatever). pays to post the actual bytes

Question 7

If you have double-encoded UTF8 characters (various smart quotes, dashes, apostrophe â€TM, quotation mark â€œ, etc), in mysql you can dump the data, then read it back in to fix the broken encoding.

Like this:

mysqldump -h DB_HOST -u DB_USER -p DB_PASSWORD --opt --quote-names \
 --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql
mysql -h DB_HOST -u DB_USER -p DB_PASSWORD \
 --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

This was a 100% fix for my double encoded UTF-8.

Source: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

Question 8

Seems to have successfully converted a Typo3 database for me. Thanks for posting this; it's much cleaner than any other conversion method. :)

Question 9

I wish I could give you more upvotes, you really really deserve them.

Question 10

Yep, also worked for me! Thanks to you sharing it here and thanks to the owner of the blog :)

Question 11

Ran into the problem when transferring a Wordpress DB from staging to local environment by exporting it with Sequel Pro.

Question 12

Works perfectly! I also had to fix an old TYPO3 database and this just did the trick!

Question 13

If you utf8_encode() on a string that is already UTF-8 then it looks garbled when it is encoded multiple times.

I made a function toUTF8() that converts strings into UTF-8.

You don't need to specify what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or a mix of these three.

I used this myself on a feed with mixed encodings in the same string.

Usage:

$utf8_string = Encoding::toUTF8($mixed_string);
$latin1_string = Encoding::toLatin1($mixed_string);

My other function fixUTF8() fixes garbled UTF8 strings if they were encoded into UTF8 multiple times.

Usage:

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Download:

https://github.com/neitanod/forceutf8

Question 14

Seems to do the trick. I don't use it for normal output, but I do enjoy using your class for data migration help.

Question 15

Thanks. It's magical, isn't it? I think this little piece of code is one of the most satisfying things I've produced, in terms of problems solved with it. :-)

Question 16

I recommend using it for migrations, as Kristopher said, but not in a production environment. There are cases where you would want the "garbled string" to stay garbled, like in this answer.

Question 17

I have struggled with third party systems that have mixed encoding. I tested your class out, and it works well. I just ran it on fields in our database that stored outside input with mixed encoding, and it cleaned everything up. Now I am implementing it at our insert junctions. PDO doesn't identify mixed encoding by the way, thus your solution rocks!

Question 18

+1 excellent- fixUTF8 even takes care of some weird encoding errors I've seen.

Question 19

I've had to try to 'fix' a number of UTF8 broken situations in the past, and unfortunately it's never easy, and often rather impossible.

Unless you can determine exactly how it was broken, and it was always broken in that exact same way, then it's going to be hard to 'undo' the damage.

If you want to try to undo the damage, your best bet would be to start writing some sample code, where you attempt numerous variations on calls to mb_convert_encoding() to see if you can find a combination of 'from' and 'to' that fixes your data. In the end, it's often best to not even bother worrying about fixing the old data because of the pain levels involved, but instead to just fix things going forward.

However, before doing this, you need to make sure that you fix everything that is causing this issue in the first place. You've already mentioned that your DB table collation and editors are set properly. But there are more places where you need to check to make sure that everything is properly UTF-8:

Make sure that you are serving your HTML as UTF-8:
- header("Content-Type: text/html; charset=utf-8");
Change your PHP default charset to utf-8:
- ini_set("default_charset", 'utf-8');
If your database doesn't ALWAYS talk in utf-8, then you may need to tell it on a per connection basis to ensure it's in utf-8 mode, in MySQL you do that by issuing:
- charset utf8
You may need to tell your webserver to always try to talk in UTF8, in Apache this command is:
- AddDefaultCharset UTF-8
Finally, you need to ALWAYS make sure that you are using PHP functions that are properly UTF-8 complaint. This means always using the mb_* styled 'multibyte aware' string functions. It also means when calling functions such as htmlspecialchars(), that you include the appropriate 'utf-8' charset parameter at the end to make sure that it doesn't encode them incorrectly.

If you miss up on any one step through your whole process, the encoding can be mangled and problems arise. Once you get in the 'groove' of doing utf-8 though, this all becomes second nature. And of course, PHP6 is supposed to be fully unicode complaint from the getgo, which will make lots of this easier (hopefully)

Question 20

Thank you very much! Because there are also many correctly encoded Strings in the DB, wich makes the Problem worse, i chose to str_replace the Strings i know that are corrupt with their correct Characters. It works great. I have already implemented most of your Tips regarding PHP and Server Setup, but it is a great summary, so i would chose this as the Answer, because my solution is not really beautiful.

Question 21

One important note on this advice: Do NOT add 'utf-8' as the second argument to the function htmlspecialchars(). Without the argument, that function does the correct thing with UTF-8 strings, since it ignores all bytes with the high bit set and passes them. This will preserve them and "does the right thing". With 'utf-8', htmlspecialchars() interprets the UTF-8 string - but doesn't handle characters outside the BMP (those with code points U+10000 and above, encoded in four bytes). It incorrectly encodes those that happen to match the specials mod 65536.. The behavior is both slower and wrong.

Question 22

Please, see my answer below. I addressed all this problems in a single pure-PHP function: fixUTF8(). You don't need to change your server configuration, and you don't even need to have the multi byte functions installed. The function is smart enough to fix any character independently, even if the encoding is mixed inside the same string (no matter how many times it was converted or if it's in UTF8 already).

Question 23

PHP 6 was skipped, PHP 7 will be in one month a stable release.

Question 24

@Jayrox: There is a better answer witg a tool from github: stackoverflow.com/a/3521340/196210

Question 25

I had a problem with an xml file that had a broken encoding, it said it was utf-8 but it had characters that where not utf-8.
After several trials and errors with the mb_convert_encoding() I manage to fix it with

mb_convert_encoding($text, 'Windows-1252', 'UTF-8')

Question 26

This worked for me after days of banging my head over the issue (everything was UTF-8 end to end but in RSS it wasn't!) Thank you!

Question 27

My problem was: Database fields saved as latin1_swedish_ci, output by PHP as utf-8 showing Umlaute ü as Ã¼ and ö as Ã¶. This helped to fix this.

Question 28

As Dan pointed out: you need to convert them to binary and then convert/correct the encoding.

E.g., for utf8 stored as latin1 the following SQL will fix it:

UPDATE table
 SET field = CONVERT( CAST(field AS BINARY) USING utf8)
 WHERE $broken_field_condition

Question 29

interesting; i'll remember this if i have the issue again. thanks

Question 30

Makes sense. I guess it's really double-encoded, it's just that the field is marked latin1 even though it really contains UTF8, so when you request the field as UTF8 it encodes it again.

Question 31

Man, you made my day, it worked for me. Now I'd like to understand the real reason why the dump I'm working with has these wrong characters (maybe it was correctly encoded in utf-8 but the dump process printed the output as latin1)

Question 32

WHERE LENGTH( field ) != CHAR_LENGTH( field ) ;)

Question 33

$bad_string = "Luis PÃ©rez Casas, del Collettivo di avvocati â€œJosÃ© Alvear Restrepoâ,ドル Colombia, unâ€TMorganizzazione soggetta a costanti minacce";
$good_string = fix_broken_chars($bad_string);
echo $good_string;
function fix_broken_chars($garbled_utf8_string)
{ 
 $conv_table = unserialize('a:5:{i:0;a:3:{s:8:"â€TM";s:3:"’";s:8:"â€"";s:3:"–";s:8:"â€"";s:3:"—";}i:1;a:12:{s:7:"â‚¬";s:3:"€";s:7:"â€š";s:3:"‚";s:7:"â€ž";s:3:"„";s:7:"â€¦";s:3:"...";s:7:"â€¡";s:3:"‡";s:7:"â€°";s:3:"‰";s:7:"â1ドル";s:3:"‹";s:7:"â€ ̃";s:3:"‘";s:7:"â€œ";s:3:""";s:7:"â€¢";s:3:"•";s:7:"â„¢";s:3:"TM";s:7:"â€o";s:3:"›";}i:2;a:22:{s:5:"Ã€";s:2:"À";s:5:"Ã‚";s:2:"Â";s:5:"Æ’";s:2:"ƒ";s:5:"Ã„";s:2:"Ä";s:5:"Ã...";s:2:"Å";s:5:"â€";s:3:""";s:5:"Ã†";s:2:"Æ";s:5:"Ã‡";s:2:"Ç";s:5:"Ë†";s:2:"ˆ";s:5:"Ã‰";s:2:"É";s:5:"Ã‹";s:2:"Ë";s:5:"Å’";s:2:"Œ";s:5:"Ã‘";s:2:"Ñ";s:5:"Ã’";s:2:"Ò";s:5:"Ã"";s:2:"Ó";s:5:"Ã"";s:2:"Ô";s:5:"Ã•";s:2:"Õ";s:5:"Ã–";s:2:"Ö";s:5:"Ã—";s×ばつ";s:5:"ÃTM";s:2:"Ù";s:5:"Ã›";s:2:"Û";s:5:"Å"";s:2:"œ";}i:3;a:77:{s:4:"Ãƒ";s:2:"Ã";s:4:"Ãˆ";s:2:"È";s:4:"ÃŠ";s:2:"Ê";s:4:"ÃŒ";s:2:"Ì";s:4:"Å1⁄2";s:2:"Ž";s:4:"ÃŽ";s:2:"Î";s:4:"Ëœ";s:2:" ̃";s:4:"Ã ̃";s:2:"Ø";s:4:"Å¡";s:2:"š";s:4:"Ãš";s:2:"Ú";s:4:"Ãœ";s:2:"Ü";s:4:"Å3⁄4";s:2:"ž";s:4:"Ãž";s:2:"Þ";s:4:"Å ̧";s:2:"Ÿ";s:4:"ÃŸ";s:2:"ß";s:4:"Â¡";s:2:"¡";s:4:"Ã¡";s:2:"á";s:4:"Â¢";s:2:"¢";s:4:"Ã¢";s:2:"â";s:4:"Â£";s:2:"£";s:4:"Ã£";s:2:"ã";s:4:"Â¤";s:2:"¤";s:4:"Ã¤";s:2:"ä";s:4:"Â\";s:2:"\";s:4:"Ã\";s:2:"å";s:4:"Â¦";s:2:"¦";s:4:"Ã¦";s:2:"æ";s:4:"Â§";s:2:"§";s:4:"Ã§";s:2:"ç";s:4:"Â ̈";s:2:" ̈";s:4:"Ã ̈";s:2:"è";s:4:"Â©";s:2:"©";s:4:"Ã©";s:2:"é";s:4:"Âa";s:2:"a";s:4:"Ãa";s:2:"ê";s:4:"Â«";s:2:"«";s:4:"Ã«";s:2:"ë";s:4:"Â¬";s:2:"¬";s:4:"Ã¬";s:2:"ì";s:4:"Â";s:2:"";s:4:"Ã";s:2:"í";s:4:"Â®";s:2:"®";s:4:"Ã®";s:2:"î";s:4:"Â ̄";s:2:" ̄";s:4:"Ã ̄";s:2:"ï";s:4:"Â°";s:2:"°";s:4:"Ã°";s:2:"ð";s:4:"Â±";s:2:"±";s:4:"Ã±";s:2:"ñ";s:4:"Â2";s:2:"2";s:4:"Ã2";s:2:"ò";s:4:"Â3";s:2:"3";s:4:"Ã3";s:2:"ó";s:4:"Â ́";s:2:" ́";s:4:"Ã ́";s:2:"ô";s:4:"Âμ";s:2:"μ";s:4:"Ãμ";s:2:"õ";s:4:"Â¶";s:2:"¶";s:4:"Ã¶";s:2:"ö";s:4:"Â·";s:2:"·";s:4:"Ã·";s:2:"÷";s:4:"Â ̧";s:2:" ̧";s:4:"Ã ̧";s:2:"ø";s:4:"Â1";s:2:"1";s:4:"Ã1";s:2:"ù";s:4:"Âo";s:2:"o";s:4:"Ão";s:2:"ú";s:4:"Â»";s:2:"»";s:4:"Ã»";s:2:"û";s:4:"Â1⁄4";s:2:"1⁄4";s:4:"Ã1⁄4";s:2:"ü";s:4:"Â1⁄2";s:2:"1⁄2";s:4:"Ã1⁄2";s:2:"ý";s:4:"Â3⁄4";s:2:"3⁄4";s:4:"Ã3⁄4";s:2:"þ";s:4:"Â¿";s:2:"¿";s:4:"Ã¿";s:2:"ÿ";}i:4;a:1:{s:2:"Ã";s:2:"à";}}');
 foreach ($conv_table as $convert) {
 $garbled_utf8_string = str_replace(array_keys($convert), $convert, $garbled_utf8_string); 
 }
 return $garbled_utf8_string;
}

Implements this table http://www.i18nqa.com/debug/utf8-debug.html

Question 34

Doesn't work for some characters, but works good enough. Thanks!

Question 35

I know this isn't very elegant, but after it was mentioned that the strings may be double encoded, I made this function:

function fix_double encoding($string)
{
 $utf8_chars = explode(' ', 'À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö ×ばつ Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö');
 $utf8_double_encoded = array();
 foreach($utf8_chars as $utf8_char)
 {
 $utf8_double_encoded[] = utf8_encode(utf8_encode($utf8_char));
 }
 $string = str_replace($utf8_double_encoded, $utf8_chars, $string);
 return $string;
}

This seems to work perfectly to remove the double encoding I am experiencing. I am probably missing some of the characters that could be an issue to others. However, for my needs it is working perfectly.

Question 36

Take a look at my answer. The function Encoding::fixUTF8(). It fixes all UTF8 characters (there are millions of them), and can handle strings encoded multiple times, not only twice.

Question 37

The way is to convert to binary and then to correct encoding

Question 38

What? That doesn't even begin to make sense!

Question 39

Another thing to check, which happened to be my solution (found here), is how data is being returned from your server. In my application, I'm using PDO to connect from PHP to MySQL. I needed to add a flag to the connection which said get the data back in UTF-8 format

The answer was

$dbHandle = new PDO("mysql:host=$dbHost;dbname=$dbName;charset=utf8", $dbUser, $dbPass, 
 array(PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES 'utf8'"));

Question 40

In my case, I found out by using "mb_convert_encoding" that the previous encoding was iso-8859-1 (which is latin1) then I fixed my problem by using an sql query :

UPDATE myDB.myTable SET myColumn = CAST(CAST(CONVERT(myColumn USING latin1) AS binary) AS CHAR)

However, it is stated in the mysql documentations that conversion may be lossy if the column contains characters that are not in both character sets.

Question 41

It looks like your utf-8 is being interpreted as iso8859-1 or Win-1250 at some point.

When you say "In my database I have a few instances of bad encodings" - how did you check this? Through your app, phpmyadmin or the command line client? Are all utf-8 encodings showing up like this or only some? Is it possible you had the encodings wrong and it has been incorrectly converted from iso8859-1 to utf-8 when it was utf-8 already?

Question 42

I use phpmyadmin for database management. And no, not all cases are badly encoded.

Question 43

i had the same problem long time ago, and it fixed it using

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15">

Question 44

I found a solution after days of search. My comment is going to be buried but anyway...

I get the corrupted data with php.
I don't use set names UTF8
I use utf8_decode() on my data
I update my database with my new decoded data, still not using set names UTF8

and voilà :)

Question 45

This script had a nice approach. Converting it to the language of your choice should not be too difficult:

http://plasmasturm.org/log/416/

#!/usr/bin/perl
use strict;
use warnings;
use Encode qw( decode FB_QUIET );
binmode STDIN, ':bytes';
binmode STDOUT, ':encoding(UTF-8)';
my $out;
while ( <> ) {
 $out = '';
 while ( length ) {
 # consume input string up to the first UTF-8 decode error
 $out .= decode( "utf-8", $_, FB_QUIET );
 # consume one character; all octets are valid Latin-1
 $out .= decode( "iso-8859-1", substr( $_, 0, 1 ), FB_QUIET ) if length;
 }
 print $out;
}

Question 46

I recently had to work on a legacy project which had the MySQL table collation set to latin1_swedish_ci and when I retrieved the data from PHP it was showing up as encoded garbage ¡à\à¤ ̈à\‡ à¤°à¤1à¤° à¤. The text was supposed to show up as utf8.

Specifying the charset after the db connection in PHP fixed it for me:

mysqli_set_charset($conn,"latin1");

I'd like to know how setting up charset to latin1 fixed this up and how to clean up the db properly from someone more knowledgeable about this.

jsdalton 6,8955 gold badges43 silver badges40 bronze badges · Accepted Answer · 2010-12-16 16:05:25Z

98

If you have double-encoded UTF8 characters (various smart quotes, dashes, apostrophe â€TM, quotation mark â€œ, etc), in mysql you can dump the data, then read it back in to fix the broken encoding.

Like this:

mysqldump -h DB_HOST -u DB_USER -p DB_PASSWORD --opt --quote-names \
 --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql
mysql -h DB_HOST -u DB_USER -p DB_PASSWORD \
 --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

This was a 100% fix for my double encoded UTF-8.

Source: http://blog.hno3.org/2010/04/22/fixing-double-encoded-utf-8-data-in-mysql/

Share

Improve this answer

edited Apr 20, 2016 at 12:22

Aaron D's user avatar

Aaron D

7,7304 gold badges49 silver badges50 bronze badges

answered Dec 16, 2010 at 16:05

jsdalton's user avatar

jsdalton

6,8955 gold badges43 silver badges40 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Energiequant

Energiequant Over a year ago

Seems to have successfully converted a Typo3 database for me. Thanks for posting this; it's much cleaner than any other conversion method. :)

2011年03月14日T11:43:35.357Z+00:00

Frost

Frost Over a year ago

I wish I could give you more upvotes, you really really deserve them.

2011年11月11日T21:53:26.84Z+00:00

Prine

Prine Over a year ago

Yep, also worked for me! Thanks to you sharing it here and thanks to the owner of the blog :)

2012年03月28日T12:10:12.723Z+00:00

Yves Van Broekhoven

Yves Van Broekhoven Over a year ago

Ran into the problem when transferring a Wordpress DB from staging to local environment by exporting it with Sequel Pro.

2012年11月12日T12:55:46.89Z+00:00

user828591

user828591 Over a year ago

Works perfectly! I also had to fix an old TYPO3 database and this just did the trick!

2015年11月23日T13:02:15.983Z+00:00

|

CollectivesTM on Stack Overflow

Fixing broken UTF-8 encoding

15 Answers 15

8 Comments

13 Comments

5 Comments

2 Comments

4 Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

15 Answers 15

8 Comments

13 Comments

5 Comments

2 Comments

4 Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related