UTF-8 and os.listdir()

Question 1

I'm having a bit of trouble with a file containing the "ș" character (that's \xC8\x99 in UTF-8 - LATIN SMALL LETTER S WITH COMMA BELOW).

I'm creating a ș.txt file and trying to get it back with os.listdir(). Unfortunately, os.listdir() returns it back as s\xCC\xA6 ("s" + COMBINING COMMA BELOW) and my test program (below) fails.

This happens on my OS X, but it works on a Linux machine. Any idea what exactly causes this behavior (both environments are configured with LANG=en_US.UTF8) ?

Here's the test program:

#coding: utf-8
import os
fname = "ș.txt"
with open(fname, "w") as f:
 f.write("hi")
files = os.listdir(".")
print "fname: ", fname
print "files: ", files
if fname in files:
 print "found"
else:
 print "not found"

Question 2

The OS X filesystem mostly uses decomposed characters rather than their combined form. You'll need to normalise the filenames back to the NFC combined normalised form:

import unicodedata
files = [unicodedata.normalize('NFC', f) for f in os.listdir(u'.')]

This processes filenames as unicode; you'd otherwise need to decode the bytestring to unicode first.

Also see the unicodedata.normalize() function documentation.

Question 3

Thanks for the link, I understand what's going on now. Your code is not working btw, I need to do u"ș.txt" in [unicodedate.normalize('NFC', f) for f in os.listdir(u'.')] instead.

Question 4

@Unknown: right, or decode and again encode. But using a unicode path is better.

Question 5

@Unknown how can you do that? I'm facing with that problem tooo

Question 6

@NamPham: do what exactly, what problem are you facing? The files list will contain a list of Unicode string objects, each normalised.

Question 7

I'm faceing about decoding and encoding process, I can't put u'.' as an argument for listdir. My path is unicode :(

Martijn Pieters 1.1m326 gold badges4.2k silver badges3.5k bronze badges · Accepted Answer · 2014-11-04 10:40:08Z

10

The OS X filesystem mostly uses decomposed characters rather than their combined form. You'll need to normalise the filenames back to the NFC combined normalised form:

import unicodedata
files = [unicodedata.normalize('NFC', f) for f in os.listdir(u'.')]

This processes filenames as unicode; you'd otherwise need to decode the bytestring to unicode first.

Also see the unicodedata.normalize() function documentation.

Share

Improve this answer

edited Nov 4, 2014 at 11:35

answered Nov 4, 2014 at 10:40

Martijn Pieters's user avatar

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.5k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Unknown

Unknown Over a year ago

Thanks for the link, I understand what's going on now. Your code is not working btw, I need to do u"ș.txt" in [unicodedate.normalize('NFC', f) for f in os.listdir(u'.')] instead.

2014年11月04日T11:03:36.983Z+00:00

Martijn Pieters

Martijn Pieters Over a year ago

@Unknown: right, or decode and again encode. But using a unicode path is better.

2014年11月04日T11:04:58.773Z+00:00

Nam Pham

Nam Pham Over a year ago

@Unknown how can you do that? I'm facing with that problem tooo

2016年05月19日T11:18:52.4Z+00:00

Martijn Pieters

Martijn Pieters Over a year ago

@NamPham: do what exactly, what problem are you facing? The files list will contain a list of Unicode string objects, each normalised.

2016年05月19日T11:20:58.197Z+00:00

Nam Pham

Nam Pham Over a year ago

I'm faceing about decoding and encoding process, I can't put u'.' as an argument for listdir. My path is unicode :(

2016年05月19日T15:34:37.357Z+00:00

|

CollectivesTM on Stack Overflow

UTF-8 and os.listdir()

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related