51

This code:

for root, dirs, files in os.walk('.'):
 print(root)

Gives me this error:

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 27: surrogates not allowed

How do I walk through a file tree without getting toxic strings like this?

asked Dec 8, 2014 at 20:38
3
  • Python 3.4.0 (default, Apr 11 2014, 13:05:11) on Ubuntu 14.04. I have LANG=en_US.UTF-8 Commented Dec 8, 2014 at 20:53
  • 2
    does print(root.encode("utf-8", "surrogateescape")) have any effect? Commented Dec 8, 2014 at 21:13
  • stackoverflow.com/questions/38147259/… has a somewhat more detailed explanation of what the error message means. Commented Feb 6, 2019 at 8:15

4 Answers 4

65

On Linux, filenames are 'just a bunch of bytes', and are not necessarily encoded in a particular encoding. Python 3 tries to turn everything into Unicode strings. In doing so the developers came up with a scheme to translate byte strings to Unicode strings and back without loss, and without knowing the original encoding. They used partial surrogates to encode the 'bad' bytes, but the normal UTF8 encoder can't handle them when printing to the terminal.

For example, here's a non-UTF8 byte string:

>>> b'C\xc3N'.decode('utf8','surrogateescape')
'C\udcc3N'

It can be converted to and from Unicode without loss:

>>> b'C\xc3N'.decode('utf8','surrogateescape').encode('utf8','surrogateescape')
b'C\xc3N'

But it can't be printed:

>>> print(b'C\xc3N'.decode('utf8','surrogateescape'))
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 1: surrogates not allowed

You'll have to figure out what you want to do with file names with non-default encodings. Perhaps just encoding them back to original bytes and decode them with unknown replacement. Use this for display but keep the original name to access the file.

>>> b'C\xc3N'.decode('utf8','replace')
C�N

os.walk can also take a byte string and will return byte strings instead of Unicode strings:

for p,d,f in os.walk(b'.'):

Then you can decode as you like.

answered Dec 8, 2014 at 21:21
Sign up to request clarification or add additional context in comments.

5 Comments

I ended up doing bad_string.encode('utf-8', 'surrogateescape').decode('ISO-8859-1')
@Collin Anderson How did you detect the occurrence of the bad string, how did you catch error?
What worked for me was "bad string".encode('utf-8', 'surrogateescape').decode('utf-8')
You get upvote, and Python gets -10 points for Gryffindor.
@DoTheEvo Collins hack works on both good and bad strings. It works because every byte is a valid code point in 'ISO-8859-1'. However it will print weird things for characters that don't have the same utf-8 and 'ISO-8859-1' encoding.
20

Try using this line of code:

"bad string".encode('utf-8', 'replace').decode()
Hari Krishnan
2,0793 gold badges22 silver badges31 bronze badges
answered Apr 27, 2020 at 12:30

Comments

12

I ended up passing in a byte string to os.walk() which will apparently return byte strings instead of incorrect unicode strings

for root, dirs, files in os.walk(b'.'):
 print(root)
answered Dec 8, 2014 at 21:34

Comments

-4

Filter with sed or grep:

set | sed -n '/^[a-zA-Z0-9_]*=/p'
# ... or ...
set | grep '^[a-zA-Z0-9_]*='
# ... or ...
set | egrep '^[_[:alnum:]]+='

This is sensitive to how crazy your variable names are. The last version should handle most crazy things.

answered Apr 16, 2018 at 18:36

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.