Python3 utf-8 decode issue

Question 1

The following code runs fine with Python3 on my Windows machine and prints the character 'é':

data = b"\xc3\xa9"
print(data.decode('utf-8'))

However, running the same on an Ubuntu based docker container results in :

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

Is there anything that I have to install to enable utf-8 decoding ?

Question 2

Specifying to decode the given string as 'utf-8' should work regardless. I get the error you quote only when I explicitly specify 'ascii' as codec. Your error also hints that ascii is being used. I know of no linux which uses anything else than utf-8 as default for many years....

Question 3

@planetmaker: There are definitely some "minimal" setups for Linux that default to LANG=C, where the print, not the decode, would have the problem. Explicitly changing to LANG=en_US.utf-8 in the relevant shell initialization file (and logging out and then back in to make sure the locale is set properly everywhere) should fix it.

Question 4

@ShadowRanger it doesn't, on Ubuntu Xenial, at least. I have had locale lv_LV.Utf-8 from very start always, but python defaults to ascii. Only found out recently when tried to enter unicode in CLI. In files always had encoding specified via comment.

Question 5

Seems ubuntu - depending on version - uses one encoding or another as default, and it may vary between shell and python as well. Adopted from this posting and also this blog:

Thus the recommended way seems to be to tell your python instance to use utf-8 as default encoding:

Set your default encoding of python source files via environment variable:

export PYTHONIOENCODING=utf8

Also, in your source files you can state the encoding you prefer to be used explicitly, so it should work irrespective of environment setting (see this question + answer, python docs and PEP 263:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
....

Concerning the interpretation of encoding of files read by python, you can specify it explicitly in the open command

with open(fname, "rt", encoding="utf-8") as f:
 ...

and there's a more hackish way with some side effects, but saves you to explicitly specify it each time

import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')

Please read the warnings about this hack in the related answer and comments.

Question 6

Don't know why this was downvoted, this is exactly the case I had recently (Ubuntu shell using utf8, but python command line interpreter using ascii). Just note that in python code (files) you don't need to use sys, just specify encoding via comment at start of file.

Question 7

Thanks, @Gnudiff - I added that way to the answer

Question 8

Please note that the # coding: utf8 line in the source header and sys.setdefaultencoding do different things. The first is about how the Python interpreter deals with string literals in the source code. The second one affects the default encoding when you open() a file without specifying a codec explicitly. You should also be aware that the setdefaultencoding tricks is a hack and can have bad side effects (I forgot what specifically, because I never use that). It's better to always use open(fn, encoding=...).

Question 9

@lenz of course... should have added that, too. And amended as well. Thanks!

Question 10

The problem is with the print() expression, not with the decode() method. If you look closely, the raised exception is a UnicodeEncodeError, not a -DecodeError.

Whenever you use the print() function, Python converts its arguments to a str and subsequently encodes the result to bytes, which are sent to the terminal (or whatever Python is run in). The codec which is used for encoding (eg. UTF-8 or ASCII) depends on the environment. In an ideal case,

the codec which Python uses is compatible with the one which the terminal expects, so the characters are displayed correctly (otherwise you get mojibake like "Ã©" instead of "é");
the codec used covers a range of characters that is sufficient for your needs (such as UTF-8 or UTF-16, which contain all characters).

In your case, the second condition isn't met for the Linux docker you mention: the encoding used is ASCII, which only supports characters found on an old English typewriter. These are a few options to address this problem:

Set environment variables: on Linux, Python's encoding defaults depend on this (at least partially). In my experience, this is a bit of a trial and error; setting LC_ALL to something containing "UTF-8" worked for me once. You'll have to put them in start-up script for the shell your terminal runs, eg. .bashrc.
Re-encode STDOUT, like so:
```
sys.stdout = open(sys.stdout.buffer.fileno(), 'w', encoding='utf8')
```
The encoding used has to match the one of the terminal.
Encode the strings yourself and send them to the binary buffer underlying sys.stdout, eg. sys.stdout.buffer.write("é".encode('utf8')). This is of course much more boilerplate than print("é"). Again, the encoding used has to match the one of the terminal.
Avoid print() altogether. Use open(fn, encoding=...) for output, the logging module for progress info – depending on how interactive your script is, this might be worthwhile (admittedly, you'll probably face the same encoding problem when writing to STDERR with the logging module).

There might be other options, but I doubt that there are nicer ones.

planetmaker 6,0943 gold badges31 silver badges40 bronze badges · Accepted Answer · 2017-12-25 14:24:57Z

Seems ubuntu - depending on version - uses one encoding or another as default, and it may vary between shell and python as well. Adopted from this posting and also this blog:

Thus the recommended way seems to be to tell your python instance to use utf-8 as default encoding:

Set your default encoding of python source files via environment variable:

export PYTHONIOENCODING=utf8

Also, in your source files you can state the encoding you prefer to be used explicitly, so it should work irrespective of environment setting (see this question + answer, python docs and PEP 263:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
....

Concerning the interpretation of encoding of files read by python, you can specify it explicitly in the open command

with open(fname, "rt", encoding="utf-8") as f:
 ...

and there's a more hackish way with some side effects, but saves you to explicitly specify it each time

import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys) # Reload does the trick!
sys.setdefaultencoding('UTF8')

Please read the warnings about this hack in the related answer and comments.

Don't know why this was downvoted, this is exactly the case I had recently (Ubuntu shell using utf8, but python command line interpreter using ascii). Just note that in python code (files) you don't need to use sys, just specify encoding via comment at start of file.
Please note that the # coding: utf8 line in the source header and sys.setdefaultencoding do different things. The first is about how the Python interpreter deals with string literals in the source code. The second one affects the default encoding when you open() a file without specifying a codec explicitly. You should also be aware that the setdefaultencoding tricks is a hack and can have bad side effects (I forgot what specifically, because I never use that). It's better to always use open(fn, encoding=...).
@lenz of course... should have added that, too. And amended as well. Thanks!

CollectivesTM on Stack Overflow

Python3 utf-8 decode issue

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related