Python, UnicodeDecodeError

Question 1

I get this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 4: ordinal not in range(128)

I tried setting many different codecs (in the header, like # -*- coding: utf8 -*-), or even using u"string", but it still appears.

How do I fix this?

Edit: I don't know the actual character that's causing this, but since this is a program that recursively browses folders, it must have found a file with strange characters in its name

Code:

# -*- coding: utf8 -*-
# by TerabyteST
###########################
# Explores given path recursively
# and finds file which size is bigger than the set treshold
import sys
import os
class Explore():
 def __init__(self):
 self._filelist = []
 def exploreRec(self, folder, treshold):
 print folder
 generator = os.walk(folder + "/")
 try:
 content = generator.next()
 except:
 return
 folders = content[1]
 files = content[2]
 for n in folders:
 if "$" in n:
 folders.remove(n)
 for f in folders:
 self.exploreRec(u"%s/%s"%(folder, f), treshold)
 for f in files:
 try:
 rawsize = os.path.getsize(u"%s/%s"%(folder, f))
 except:
 print "Error reading file %s"%u"%s/%s"%(folder, f)
 continue
 mbsize = rawsize / (1024 * 1024.0)
 if mbsize >= treshold:
 print "File %s is %d MBs!"%(u"%s/%s"%(folder, f), mbsize)

Error:

Traceback (most recent call last):
 File "<pyshell#19>", line 1, in <module>
 a.exploreRec("C:", 100)
 File "D:/Python/Explorator/shitfinder.py", line 35, in exploreRec
 print "Error reading file %s"%u"%s/%s"%(folder, f)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 4: ordinal not in range(128)

Here is what is shown using print repr("Error reading file %s"%u"%s/%s"%(folder.decode('utf-8','ignore'), f.decode('utf-8','ignore')))

>>> a = Explore()
>>> a.exploreRec("C:", 100)
File C:/Program Files/Ableton/Live 8.0.4/Resources/DefaultPackages/Live8Library_v8.2.alp is 258 MBs!
File C:/Program Files/Adobe/Reader 9.0/Setup Files/{AC76BA86-7AD7-1040-7B44-A90000000001}/Data1.cab is 114 MBs!
File C:/Program Files/Microsoft Games/Age of Empires III/art/Art1.bar is 393 MBs!
File C:/Program Files/Microsoft Games/Age of Empires III/art/art2.bar is 396 MBs!
File C:/Program Files/Microsoft Games/Age of Empires III/art/art3.bar is 228 MBs!
File C:/Program Files/Microsoft Games/Age of Empires III/Sound/Sound.bar is 273 MBs!
File C:/ProgramData/Microsoft/Search/Data/Applications/Windows/Windows.edb is 162 MBs!
REPR:
u"Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/0/Sito web di Mirror's Edge.lnk"
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/0/Sito web di Mirror's Edge.lnk
REPR:
u"Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/1/Contenuti scaricabili di Mirror's Edge.lnk"
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/GameExplorer/{1B4801C1-CA86-487E-8347-B26F1CCB2F75}/SupportTasks/1/Contenuti scaricabili di Mirror's Edge.lnk
REPR:
u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Google Talk/Supporto/Modalitiagnostica di Google Talk.lnk'
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Google Talk/Supporto/Modalitiagnostica di Google Talk.lnk
REPR:
u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Microsoft SQL Server 2008/Strumenti di configurazione/Segnalazione errori e utilizzo funzionaliti SQL Server.lnk'
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Microsoft SQL Server 2008/Strumenti di configurazione/Segnalazione errori e utilizzo funzionaliti SQL Server.lnk
REPR:
u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox/Mozilla Firefox ( Modalitrovvisoria).lnk'
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox/Mozilla Firefox ( Modalitrovvisoria).lnk
REPR:
u'Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox 3.6 Beta 1/Mozilla Firefox 3.6 Beta 1 ( Modalitrovvisoria).lnk'
END REPR:
Error reading file C:/ProgramData/Microsoft/Windows/Start Menu/Programs/Mozilla Firefox 3.6 Beta 1/Mozilla Firefox 3.6 Beta 1 ( Modalitrovvisoria).lnk
Traceback (most recent call last):
 File "<pyshell#21>", line 1, in <module>
 a.exploreRec("C:", 100)
 File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
 self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
 File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
 self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
 File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
 self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
 File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
 self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
 File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
 self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
 File "D:/Python/Explorator/shitfinder.py", line 30, in exploreRec
 self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x99 in position 78: ordinal not in range(128)
>>>

Question 2

A full traceback would help too.

Question 3

@John: Not completely true. The "coding: utf-8" comment in the beginning determines, which source file encoding python expects.

Question 4

@SpeckiniusFlecksis Yes, in the year since posting that comment, I have learned otherwise. :)

Question 5

VTC: Unclear thus impossible to answer.

Question 6

We can't guess what you are trying to do, nor what's in your code, not what "setting many different codecs" means, nor what u"string" is supposed to do for you.

Please change your code to its initial state so that it reflects as best you can what you are trying to do, run it again, and then edit your question to provide (1) the full traceback and error message that you get (2) snippet encompassing the last statement in your script that appears in the traceback (3) a brief description of what you want the code to do (4) what version of Python you are running.

Edit after details added to question:

(0) Let's try some transformations on the failing statement:

Original:
print "Error reading file %s"%u"%s/%s"%(folder, f)
Add spaces for reduced illegibility:
print "Error reading file %s" % u"%s/%s" % (folder, f)
Add parentheses to emphasise evaluation order:
print ("Error reading file %s" % u"%s/%s") % (folder, f)
Evaluate the (constant) expression in parentheses:
print u"Error reading file %s/%s" % (folder, f)

Is that really what you intended? Suggestion: construct the path ONCE, using a better method (see point (2) below).

(1) In general, use repr(foo) or "%r" % foo for diagnostics. That way, your diagnostic code is much less likely to cause an exception (as is happening here) AND you avoid ambiguity. Insert the statement print repr(folder), repr(f) before you try to get the size, rerun, and report back.

(2) Don't make paths by u"%s/%s" % (folder, filename) ... use os.path.join(folder, filename)

(3) Don't have bare excepts, check for known problems. So that unknown problems don't remain unknown, do something like this:

try:
 some_code()
except ReasonForBaleOutError:
 continue
except: 
 # something's gone wrong, so get diagnostic info
 print repr(interesting_datum_1), repr(interesting_datum_2)
 # ... and get traceback and error message
 raise

A more sophisticated way would involve logging instead of printing, but the above is much better than not knowing what's going on.

Further edits after rtm("os.walk"), remembering old legends, and re-reading your code:

(4) os.walk() walks over the whole tree; you don't need to call it recursively.

(5) If you pass a unicode string to os.walk(), the results (paths, filenames) are reported as unicode. You don't need all that u"blah" stuff. Then you just have to choose how you display the unicode results.

(6) Removing paths with "$" in them: You must modify the list in situ but your method is dangerous. Try something like this:

for i in xrange(len(folders), -1, -1):
 if '$' in folders[i]:
 del folders[i]

(7) Your refer to files by joining a folder name and a file name. You are using the ORIGINAL folder name; when you rip out the recursion, this won't work; you'll need to use the currently-discarded content[0] value reported by os.walk.

(8) You should find yourself using something very simple like:

for folder, subfolders, filenames in os.walk(unicoded_top_folder):

There's no need for generator = os.walk(...); try: content = generator.next() etc and BTW if you ever need to do generator.next() in the future, use except StopIteration instead of a bare except.

(9) If the caller provides a non-existent folder, no exception is raised, it just does nothing. If the provided folder is exists but is empty, ditto. If you need to distinguish between these two scenarios, you'll need to do extra testing yourself.

Response to this comment from the OP: """Thanks, please read the info repr() has shown in the first post. I don't know why it printed so many different items, but it looks like they all have problems. And the common thing between all of them is they are .ink files. May that be the problem? Also, in the last ones, the firefox ones, it prints ( Modalitrovvisoria) while the real file name from Explorer contains ( Modalità provvisoria)"""

(10) Umm that's not ".INK".lower(), it's ".LNK".lower() ... perhaps you need to change the font in whatever you're reading that with.

(11) The fact that the "problem" file names all end in ".lnk" /may/ be something to do with os.walk() and/or Windows doing something special with the names of those files.

(12) I repeat here the Python statement that you used to produce that output, with some whitespace introduced :

print repr(
 "Error reading file %s" \
 % u"%s/%s" % (
 folder.decode('utf-8','ignore'),
 f.decode('utf-8','ignore')
 )
 )

It seems that you have not read, or not understood, or just ignored, the advice I gave you in a comment on another answer (and that answerer's reply): UTF-8 is NOT relevant in the context of file names in a Windows file system.

We are interested in exactly what folder and f refer to. You have trampled all over the evidence by attempting to decode it using UTF-8. You have compounded the obfuscation by using the "ignore" option. Had you used the "replace" option, you would have seen "( Modalit\ufffdrovvisoria)". The "ignore" option has no place in debugging.

In any case, the fact that some of the file names had some kind of error but appeared NOT to lose characters with the "ignore" option (or appeared NOT to be mangled) is suspicious.

Which part of """Insert the statement print repr(folder), repr(f) """ did you not understand? All that you need to do is something like this:

print "Some meaningful text" # "error reading file" isn't
print "folder:", repr(folder)
print "f:", repr(f)

(13) It also appears that you have introduced UTF-8 elsewhere in your code, judging by the traceback: self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)

I would like to point out that you still do not know whether folder and f refer to str objects or unicode objects, and two answers have suggested that they are very likely to be str objects, so why introduce blahbah.encode() ??

A more general point: Try to understand what your problem(s) is/are, BEFORE changing your script. Thrashing about trying every suggestion coupled with near-zero effective debugging technique is not the way forward.

(14) When you run your script again, you might like to reduce the volume of the output by running it over some subset of C:\ ... especially if you proceed with my original suggestion to have debug printing of ALL file names, not just the erroneous ones (knowing what non-error ones look like could help in understanding the problem).

Response to Bryan McLemore's "clean up" function:

(15) Here is an annotated interactive session that illustrates what actually happens with os.walk() and non-ASCII file names:

C:\junk\terabytest>dir
[snip]
 Directory of C:\junk\terabytest
20/11/2009 01:28 PM <DIR> .
20/11/2009 01:28 PM <DIR> ..
20/11/2009 11:48 AM <DIR> empty
20/11/2009 01:26 PM 11 Hašek.txt
20/11/2009 01:31 PM 1,419 tbyte1.py
29/12/2007 09:33 AM 9 Ð.txt
 3 File(s) 1,439 bytes
[snip]
C:\junk\terabytest>\python26\python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] onwin32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pprint import pprint as pp
>>> import os

os.walk(unicode_string) -> results in unicode objects

>>> pp(list(os.walk(ur"c:\junk\terabytest")))
[(u'c:\\junk\\terabytest',
 [u'empty'],
 [u'Ha\u0161ek.txt', u'tbyte1.py', u'\xd0.txt']),
 (u'c:\\junk\\terabytest\\empty', [], [])]

os.walk(str_string) -> results in str objects

>>> pp(list(os.walk(r"c:\junk\terabytest")))
[('c:\\junk\\terabytest',
 ['empty'],
 ['Ha\x9aek.txt', 'tbyte1.py', '\xd0.txt']),
 ('c:\\junk\\terabytest\\empty', [], [])]

cp1252 is the encoding I'd expect to be used on my system ...

>>> u'\u0161'.encode('cp1252')
'\x9a'
>>> 'Ha\x9aek'.decode('cp1252')
u'Ha\u0161ek'

decoding the str with UTF-8 doesn't work, as expected

>>> 'Ha\x9aek'.decode('utf8')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:\python26\lib\encodings\utf_8.py", line 16, in decode
 return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 2: unexpected code byte

ANY random string of bytes can be decoded without error using latin1

>>> 'Ha\x9aek'.decode('latin1')
u'Ha\x9aek'

BUT U+009A is a control character (SINGLE CHARACTER INTRODUCER), i.e. meaningless gibberish; absolutely nothing to do with the correct answer

>>> unicodedata.name(u'\u0161')
'LATIN SMALL LETTER S WITH CARON'
>>>

(16) That example shows what happens when the character is representable in the default character set; what happens if it's not? Here's an example (using IDLE this time) of a file name containing CJK ideographs, which definitely aren't representable in my default character set:

IDLE 2.6.4 
>>> import os
>>> from pprint import pprint as pp

repr(Unicode results) looks fine

>>> pp(list(os.walk(ur"c:\junk\terabytest\chinese")))
[(u'c:\\junk\\terabytest\\chinese', [], [u'nihao\u4f60\u597d.txt'])]

and the unicode displays just fine in IDLE:

>>> print list(os.walk(ur"c:\junk\terabytest\chinese"))[0][2][0]
nihao你好.txt

The str result is evidently produced by using .encode(whatever, "replace") -- not very useful e.g. you can't open the file by passing that as the file name.

>>> pp(list(os.walk(r"c:\junk\terabytest\chinese")))
[('c:\\junk\\terabytest\\chinese', [], ['nihao??.txt'])]

So the conclusion is that for best results, one should pass a unicode string to os.walk(), and deal with any display problems.

Question 7

the problem is that the exception happens in the except, most probably because the encode inside the try raises the same exception too, and I don't know how to get any further information about it (like what character is causing this or where is the mentioned file/folder situated on my hard drive)

Question 8

@terabytest: "don't know how to get any further information"?? Have you read my edited response especially parts (1) and (3) where it mentions the repr() function?

Question 9

thanks, please read the info repr() has shown in the first post. I don't know why it printed so many different items, but it looks like they all have problems. And the common thing between all of them is they are .ink files. May that be the problem? Also, in the last ones, the firefox ones, it prints ( Modalitrovvisoria) while the real file name from Explorer contains ( Modalità provvisoria)

Question 10

@terabytest: response far too long for a comment; see update to answer.

Question 11

Python uses ASCII encoding by default, which is annoying. If you want to change it permanently, find and edit site.py file, search for def setencoding() and few lines below change encoding = "ascii" to encoding = "utf-8". Bye, bye default ASCII encoding.

Question 12

You are trying to perform some action (e.g., print) on unicode string that contains non-ASCII characters, and the string is being converted to ascii by default. You will need specify the encoding to correctly represent the string.
It would help significantly, if you post some sample code of what you are trying to do.

The simplest way to do this would be:
s = u'ma\xf1ana';
print s.encode('latin-1');

Edited after details added to question:

In your case you need to decode the string you read first:
f.decode();,
so try changing
u"%s/%s" % (folder, f)
to
os.path.join(folder, f.decode())

Note, that 'latin-1' encoding might be needed to changed to what your file is named with

PS: John Machin has mentioned very helpful ways to improve and clean up the code. +1

Question 13

yes, i'm trying to print the string. How do I tell it not to use ascii?

Question 14

The simplest example would be as follows: s = u'ma\xf1ana'; <br/> print s.encode('latin-1')

Question 15

can't get comment to display code sample properly, so I put it in the answer.

Question 16

@terabytest: What is "the" string? Both folder and f are suspect AFAICT.

Question 17

@artdanil: (1) The OP is on Windows. Windows doesn't use latin1. Telling him to encode/decode using latin1 can't remove the fog; it can only thicken it. (2) Why are you telling him to decode? How do you know f is unicode??

Question 18

Are you running this program in a Windows cmd.exe box? If so, try running it in IDLE and see if you get the same errors. The Cmd.exe box doesn't do unicode, only ascii.

Question 19

Some unicode items:

putting # encoding: utf-8 at the top of the file sometimes help (if your editor uses UTF-8 to save your file ...)
s = "i'm a string"
u = u"i'm unicode, at least in python < ۳"
If your work with files try to look into the codecs module.

Further readings:

Question 20

u"%s" % f

In various places you're doing something similar to the above code. This is exactly the wrong way to convert a str object to a unicode object as the conversion is done using sys.getdefaultencoding() (ascii), which is almost guaranteed to be wrong.

You should be using the encode/decode methods to convert to/from a unicode object. This requires knowing what encoding your input (the strings returned from os.walk) is. E.g., if the filenames are encoded in UTF-8

uf = f.decode('utf-8')

will interpret f as a UTF-8 encoded sequence of bytes and give back the proper unicode object. Similarly, when you need to output the unicode object you would then convert it back to a str, specifying the valid encoding you want to output it as.

print uf.encode('utf-8')

Question 21

I still get the same UnicodeDecodeError: 'ascii' codec can't decode byte 0x99 in position 125: ordinal not in range(128) even though i wrote print "Error reading file %s"%("%s/%s"%(folder, f)).decode("utf-8")... Wth

Question 22

The topic is about Windows file paths, so utf8 is about as relevant as latin1.

Question 23

My post was before he added that information and was intended to be generic information explaining problems with the approach being used. Understanding those problems helps, regardless of what the actual encoding being used is. Unfortunately, my post was taken at face value instead of as a general description.

Question 24

I've had the misfortune of working in some codebases that weren't consistent with their encoding.

This is a function we used to help clean it up:

def to_unicode(value):
 if isinstance(value, unicode):
 return value
 elif isinstance(value, str):
 try:
 if value.startswith('\xff\xfe'):
 return value.decode('utf-16-le')
 elif value.startswith('\xfe\xff'):
 return value.decode('utf-16-be')
 else:
 return value.decode('utf-8')
 except UnicodeDecodeError:
 return value.decode('latin-1')
 else:
 try:
 return unicode(value)
 except UnicodeError:
 return to_unicode(str(value))
 except TypeError:
 if hasattr(value, '__unicode__'):
 return value.__unicode__()

So using that function you can use:

print u"Error reading file %s/%s" % (to_unicode(folder), to_unicode(f))

Question 25

What makes you imagine that names of directories and files as returned by os.walk() on Windows will not be encoded consistently? These names will not have UTF-16 BOMs, so your code will try UTF8, which will fail on any non-ASCII byte, and in that case return the string decoded using latin1, IOW utter rubbish with no indication that there's a problem! See my updated answer for an example.

Question 26

If you read more than the code block you'd see that this was not a tailor made function to handle this question. It is a general to_unicode function that I'd hoped would help him. Your comment is far from constructive and instead very inflammatory, and highly uncalled for.

Question 27

I saw it was a general to_unicode function. As I explained, if the str object is not encoded in latin1 or UTF-8 and doesn't start with a UTF-16 BOM, your routine decode it using latin1, which is NOT helpful -- because LATIN1 WILL DECODE ANY SEQUENCE OF BYTES WITHOUT ERROR, your routine is concealing the problem. See my example where your procedure takes an str encoded in cp1252 and returns unicode containing a control character. Such results are meaningless rubbish and it needs to be said they're meaningless rubbish -- for the OP as well as in your own application(s).

Question 28

instead of doing:

print "Error reading file %s"%u"%s/%s"%(folder, f)

Try this:

print "Error reading file %s"%u"%s/%s"%(folder.encode('ascii','ignore'), f.encode('ascii','ignore'))

Since the console can't print unicode chars, you can's see the correct name. 'ignore' tells the codec to skip those characters. you can also use 'replace' (prints a '?'), 'xmlcharrefreplace' (replaces with &x#### of the code point), 'backslashreplace' (replaces with \x###### of the code)

You will need to encode every unicode string like this that you print.

Question 29

I'm still getting this UnicodeDecodeError: 'ascii' codec can't decode byte 0x99 in position 25: ordinal not in range(128) it's a mistery

Question 30

Using 'ignore' worked for me. It's just for displaying some output - NBD if the output isn't perfect.

John Machin 83.3k12 gold badges147 silver badges193 bronze badges · Accepted Answer · 2009-11-19 21:47:46Z

We can't guess what you are trying to do, nor what's in your code, not what "setting many different codecs" means, nor what u"string" is supposed to do for you.

Please change your code to its initial state so that it reflects as best you can what you are trying to do, run it again, and then edit your question to provide (1) the full traceback and error message that you get (2) snippet encompassing the last statement in your script that appears in the traceback (3) a brief description of what you want the code to do (4) what version of Python you are running.

Edit after details added to question:

(0) Let's try some transformations on the failing statement:

Original:
print "Error reading file %s"%u"%s/%s"%(folder, f)
Add spaces for reduced illegibility:
print "Error reading file %s" % u"%s/%s" % (folder, f)
Add parentheses to emphasise evaluation order:
print ("Error reading file %s" % u"%s/%s") % (folder, f)
Evaluate the (constant) expression in parentheses:
print u"Error reading file %s/%s" % (folder, f)

Is that really what you intended? Suggestion: construct the path ONCE, using a better method (see point (2) below).

(1) In general, use repr(foo) or "%r" % foo for diagnostics. That way, your diagnostic code is much less likely to cause an exception (as is happening here) AND you avoid ambiguity. Insert the statement print repr(folder), repr(f) before you try to get the size, rerun, and report back.

(2) Don't make paths by u"%s/%s" % (folder, filename) ... use os.path.join(folder, filename)

(3) Don't have bare excepts, check for known problems. So that unknown problems don't remain unknown, do something like this:

try:
 some_code()
except ReasonForBaleOutError:
 continue
except: 
 # something's gone wrong, so get diagnostic info
 print repr(interesting_datum_1), repr(interesting_datum_2)
 # ... and get traceback and error message
 raise

A more sophisticated way would involve logging instead of printing, but the above is much better than not knowing what's going on.

Further edits after rtm("os.walk"), remembering old legends, and re-reading your code:

(4) os.walk() walks over the whole tree; you don't need to call it recursively.

(5) If you pass a unicode string to os.walk(), the results (paths, filenames) are reported as unicode. You don't need all that u"blah" stuff. Then you just have to choose how you display the unicode results.

(6) Removing paths with "$" in them: You must modify the list in situ but your method is dangerous. Try something like this:

for i in xrange(len(folders), -1, -1):
 if '$' in folders[i]:
 del folders[i]

(7) Your refer to files by joining a folder name and a file name. You are using the ORIGINAL folder name; when you rip out the recursion, this won't work; you'll need to use the currently-discarded content[0] value reported by os.walk.

(8) You should find yourself using something very simple like:

for folder, subfolders, filenames in os.walk(unicoded_top_folder):

There's no need for generator = os.walk(...); try: content = generator.next() etc and BTW if you ever need to do generator.next() in the future, use except StopIteration instead of a bare except.

(9) If the caller provides a non-existent folder, no exception is raised, it just does nothing. If the provided folder is exists but is empty, ditto. If you need to distinguish between these two scenarios, you'll need to do extra testing yourself.

Response to this comment from the OP: """Thanks, please read the info repr() has shown in the first post. I don't know why it printed so many different items, but it looks like they all have problems. And the common thing between all of them is they are .ink files. May that be the problem? Also, in the last ones, the firefox ones, it prints ( Modalitrovvisoria) while the real file name from Explorer contains ( Modalità provvisoria)"""

(10) Umm that's not ".INK".lower(), it's ".LNK".lower() ... perhaps you need to change the font in whatever you're reading that with.

(11) The fact that the "problem" file names all end in ".lnk" /may/ be something to do with os.walk() and/or Windows doing something special with the names of those files.

(12) I repeat here the Python statement that you used to produce that output, with some whitespace introduced :

print repr(
 "Error reading file %s" \
 % u"%s/%s" % (
 folder.decode('utf-8','ignore'),
 f.decode('utf-8','ignore')
 )
 )

It seems that you have not read, or not understood, or just ignored, the advice I gave you in a comment on another answer (and that answerer's reply): UTF-8 is NOT relevant in the context of file names in a Windows file system.

We are interested in exactly what folder and f refer to. You have trampled all over the evidence by attempting to decode it using UTF-8. You have compounded the obfuscation by using the "ignore" option. Had you used the "replace" option, you would have seen "( Modalit\ufffdrovvisoria)". The "ignore" option has no place in debugging.

In any case, the fact that some of the file names had some kind of error but appeared NOT to lose characters with the "ignore" option (or appeared NOT to be mangled) is suspicious.

Which part of """Insert the statement print repr(folder), repr(f) """ did you not understand? All that you need to do is something like this:

print "Some meaningful text" # "error reading file" isn't
print "folder:", repr(folder)
print "f:", repr(f)

(13) It also appears that you have introduced UTF-8 elsewhere in your code, judging by the traceback: self.exploreRec(("%s/%s"%(folder, f)).encode("utf-8"), treshold)

I would like to point out that you still do not know whether folder and f refer to str objects or unicode objects, and two answers have suggested that they are very likely to be str objects, so why introduce blahbah.encode() ??

A more general point: Try to understand what your problem(s) is/are, BEFORE changing your script. Thrashing about trying every suggestion coupled with near-zero effective debugging technique is not the way forward.

(14) When you run your script again, you might like to reduce the volume of the output by running it over some subset of C:\ ... especially if you proceed with my original suggestion to have debug printing of ALL file names, not just the erroneous ones (knowing what non-error ones look like could help in understanding the problem).

Response to Bryan McLemore's "clean up" function:

(15) Here is an annotated interactive session that illustrates what actually happens with os.walk() and non-ASCII file names:

C:\junk\terabytest>dir
[snip]
 Directory of C:\junk\terabytest
20/11/2009 01:28 PM <DIR> .
20/11/2009 01:28 PM <DIR> ..
20/11/2009 11:48 AM <DIR> empty
20/11/2009 01:26 PM 11 Hašek.txt
20/11/2009 01:31 PM 1,419 tbyte1.py
29/12/2007 09:33 AM 9 Ð.txt
 3 File(s) 1,439 bytes
[snip]
C:\junk\terabytest>\python26\python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] onwin32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pprint import pprint as pp
>>> import os

os.walk(unicode_string) -> results in unicode objects

>>> pp(list(os.walk(ur"c:\junk\terabytest")))
[(u'c:\\junk\\terabytest',
 [u'empty'],
 [u'Ha\u0161ek.txt', u'tbyte1.py', u'\xd0.txt']),
 (u'c:\\junk\\terabytest\\empty', [], [])]

os.walk(str_string) -> results in str objects

>>> pp(list(os.walk(r"c:\junk\terabytest")))
[('c:\\junk\\terabytest',
 ['empty'],
 ['Ha\x9aek.txt', 'tbyte1.py', '\xd0.txt']),
 ('c:\\junk\\terabytest\\empty', [], [])]

cp1252 is the encoding I'd expect to be used on my system ...

>>> u'\u0161'.encode('cp1252')
'\x9a'
>>> 'Ha\x9aek'.decode('cp1252')
u'Ha\u0161ek'

decoding the str with UTF-8 doesn't work, as expected

>>> 'Ha\x9aek'.decode('utf8')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:\python26\lib\encodings\utf_8.py", line 16, in decode
 return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 2: unexpected code byte

ANY random string of bytes can be decoded without error using latin1

>>> 'Ha\x9aek'.decode('latin1')
u'Ha\x9aek'

BUT U+009A is a control character (SINGLE CHARACTER INTRODUCER), i.e. meaningless gibberish; absolutely nothing to do with the correct answer

>>> unicodedata.name(u'\u0161')
'LATIN SMALL LETTER S WITH CARON'
>>>

(16) That example shows what happens when the character is representable in the default character set; what happens if it's not? Here's an example (using IDLE this time) of a file name containing CJK ideographs, which definitely aren't representable in my default character set:

IDLE 2.6.4 
>>> import os
>>> from pprint import pprint as pp

repr(Unicode results) looks fine

>>> pp(list(os.walk(ur"c:\junk\terabytest\chinese")))
[(u'c:\\junk\\terabytest\\chinese', [], [u'nihao\u4f60\u597d.txt'])]

and the unicode displays just fine in IDLE:

>>> print list(os.walk(ur"c:\junk\terabytest\chinese"))[0][2][0]
nihao你好.txt

The str result is evidently produced by using .encode(whatever, "replace") -- not very useful e.g. you can't open the file by passing that as the file name.

>>> pp(list(os.walk(r"c:\junk\terabytest\chinese")))
[('c:\\junk\\terabytest\\chinese', [], ['nihao??.txt'])]

So the conclusion is that for best results, one should pass a unicode string to os.walk(), and deal with any display problems.

the problem is that the exception happens in the except, most probably because the encode inside the try raises the same exception too, and I don't know how to get any further information about it (like what character is causing this or where is the mentioned file/folder situated on my hard drive)
@terabytest: "don't know how to get any further information"?? Have you read my edited response especially parts (1) and (3) where it mentions the repr() function?
thanks, please read the info repr() has shown in the first post. I don't know why it printed so many different items, but it looks like they all have problems. And the common thing between all of them is they are .ink files. May that be the problem? Also, in the last ones, the firefox ones, it prints ( Modalitrovvisoria) while the real file name from Explorer contains ( Modalità provvisoria)
@terabytest: response far too long for a comment; see update to answer.

CollectivesTM on Stack Overflow

Python, UnicodeDecodeError

8 Answers 8

4 Comments

Comments

7 Comments

Comments

Comments

3 Comments

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

8 Answers 8

4 Comments

Comments

7 Comments

Comments

Comments

3 Comments

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related