Re: code page
[
Date Prev][
Date Next][
Thread Prev][
Thread Next]
[
Date Index]
[
Thread Index]
- Subject: Re: code page
- From: David Given <dg@...>
- Date: 2009年5月12日 23:13:28 +0100
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Marco Antonio Abreu wrote:
> When a field
> value has one accented char, it truncate the last one ('Flávia' comes
> like 'Fl??vi' - ?? are especial chars), if the text has two accented
> chars it has the last two chars cutted and so on...
This is a classic symptom of UTF-8 misparsing.
What happens is: somebody is encoding the string as UTF-8 as follows:
46 6c c3 a1 76 69 61
Note that the 'á' is encoded as two bytes (c3 a1). However, then someone
is parsing this as if it's ISO-8859-1 (a.k.a. Latin-1), which comes out as:
Flávia
Those two bytes are now interpreted as two distinct code points.
However, now we have one code point too many, so the last one (the 'a')
is discarded.
You should probably check each stage of your pipeline to make sure that
it's receiving and accepting the right encoding --- it sounds like
something's getting it wrong.
- --
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│
│ "People who think they know everything really annoy those of us who
│ know we don't." --- Bjarne Stroustrup
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFKCfSFf9E0noFvlzgRAit9AKCqXOUrbpWR5qweRLmQhfRXmnhlVgCfSmMn
r1shX4gBRP6YIeZ4HwIupnk=
=0UxC
-----END PGP SIGNATURE-----