I'm having a bit of trouble with a file containing the "ș" character (that's \xC8\x99 in UTF-8 - LATIN SMALL LETTER S WITH COMMA BELOW).
I'm creating a ș.txt file and trying to get it back with os.listdir(). Unfortunately, os.listdir() returns it back as s\xCC\xA6 ("s" + COMBINING COMMA BELOW) and my test program (below) fails.
This happens on my OS X, but it works on a Linux machine. Any idea what exactly causes this behavior (both environments are configured with LANG=en_US.UTF8) ?
Here's the test program:
#coding: utf-8
import os
fname = "ș.txt"
with open(fname, "w") as f:
f.write("hi")
files = os.listdir(".")
print "fname: ", fname
print "files: ", files
if fname in files:
print "found"
else:
print "not found"
1 Answer 1
The OS X filesystem mostly uses decomposed characters rather than their combined form. You'll need to normalise the filenames back to the NFC combined normalised form:
import unicodedata
files = [unicodedata.normalize('NFC', f) for f in os.listdir(u'.')]
This processes filenames as unicode; you'd otherwise need to decode the bytestring to unicode first.
Also see the unicodedata.normalize() function documentation.
8 Comments
u"ș.txt" in [unicodedate.normalize('NFC', f) for f in os.listdir(u'.')] instead.files list will contain a list of Unicode string objects, each normalised.u'.' as an argument for listdir. My path is unicode :(