String encoding in Py2.7
Fabien LUCE
fabienluce at gmail.com
Tue May 29 05:19:52 EDT 2018
May 29 2018 11:12 AM, "Thomas Jollans" <tjol at tjol.eu> wrote:
> On 2018年05月29日 09:55, ftg at lutix.org wrote:
>>> Hello,
>> Using Python 2.7 (will switch to Py3 soon but Before I'd like to understand how string encoding
>> worked)
>> Oh dear. This is probably the exact wrong way to go about it: the
> interplay between string encoding, unicode and bytes is much less clear
> and easy to understand in Python 2.
Ok I will quickly jump into py3 then.
>>> Could you please tell me is I understood well what occurs in Python's mind:
>> in a .py file:
>> if I write s="héhéhé", if my file is declared as unicode coding, python will store in memory
>> s='hx82hx82hx82'
>> No, it doesn't. At the very least, you're missing some backslashes – and
> I don't know of any character encoding that using 0x82 to encode é.
> surprinsingly backslash were removed from my initial text...
ok so stored raw bytes are the one processed by the system encoder. If my console were utf-8 I would have same raw bytes string than you.
> On my system, I see
>>>>> s = 'héhéhé'
>>>> s
>> 'h\xc3\xa9h\xc3\xa9h\xc3\xa9'
>> My system uses UTF-8. If your PC is set up to uses an encoding like ISO
> 8859-15 or Windows-1252, you should see
>> 'h\xe9h\xe9h\xe9'
>> The \x?? are just Python notation.
>>> however this is not yet unicode for python interpreter this is just raw bytes. Right?
>> Right, this is a bunch of bytes:
>>>>> s
>> 'h\xe9h\xe9h\xe9'
>>>>> [ord(c) for c in s]
>> [104, 233, 104, 233, 104, 233]
>>>>> [hex(ord(c)) for c in s]
>> ['0x68', '0xe9', '0x68', '0xe9', '0x68', '0xe9']
>>>>>>>>> By the way, why 'h' is not turned into hexa value? Because it is already in the ASCII table?
>> That's just how Python 2 likes to display stuff.
>>> If I want python interpreter to recognize my string as unicode I have to declare it as unicode
>> s=u'héhéhé' and magically python will look for those
>> hex values 'x82' in the Unicode table. Still OK?
>> In principle, the unicode table has nothing to do with anything here. It
> so happens that for some characters in some encodings the value is equal
> to the code point, but that's neither here nor there.
>>> Now: how come when I declare s='héhéhé', print(s) displays well 'héhéhé'? Is it because of my shell
>> windows that is dealing well with unicode? Or is it
>> because the print function is magic?
>> It's because the print statement is magic.
>> Actually, this *only* works if the encoding of your file matches the
> default encoding required by your console. This is usually the case as
> long as you stay on the same PC, but this assumption can fall apart
> quite easily when you move code and data between systems, especially if
> they use different operating systems or (human) languages.
>> Just use Python 3. There, the print function is not magic, which makes
> life so much more logical.
Thanks
>> -- Thomas
> --
> https://mail.python.org/mailman/listinfo/python-list
More information about the Python-list
mailing list