[Python-ideas] TextIOWrapper callable encoding parameter

Mon Jun 11 17:06:18 CEST 2012

As a followup, here are some timing data that seem to confirm
a modest increase in speed as a result of implementing the
callable encoding parameter I proposed (although that would 
not be the main reason for wanting to do it.) These are just
for illustration. (Among many other reasons, _pyio benchmarks
are not very useful.)
I read four short test files using four methods for determining 
the test file's encoding. The test files are a simplified model 
of a python coding declaration (always on first line in our case 
with no BOM present [*1]) followed by mixed english and japanese 
text.
Method 0 (reopen0): 
Use the encoding callable I am proposing.
 def reopen0 (fname):
 def hook (data,buf):
 return get_encoding (data)
 t = io.open (fname, encoding=hook)
Method 1 (reopen1):
Open in binary to determine encoding, then rewrap in a 
TextIOWrapper with the correct encoding.
 def reopen1 (fname):
 b = io.open (fname, 'rb')
 line = b.readline()
 enc = get_encoding (line)
 b.seek (0)
 t = io.TextIOWrapper (b, enc, line_buffering=True)
 t.mode = 'r'
Method 2 (reopen2):
Open in binary to determine encoding, then reopen in text mode
with correct encoding.
 def reopen2 (fname):
 b = io.open (fname, 'rb')
 line = b.readline()
 enc = get_encoding (line)
 t = io.open (fname, encoding=enc)
Method 3 (reopen3):
Open in text mode (latin1) to determine encoding, then reopen
in text mode with correct encoding.
 def reopen3 (fname):
 f = io.open (fname, encoding='latin1')
 line = f.readline()
 enc = get_encoding (line)
 t = io.open (fname, encoding=enc)
The same get_encoding() function is used in all methods [*1].
The input test data are all small files (because we want
to measure encoding detection, not how fast read() runs.)
Each has a python/emacs coding declaration in the first line.
test.utf8 -- Tiny python program with coding declaration 
 and single print statement in main() function that prints
 a short word (literal) in Japanese. Encoding is utf-8
 (122 bytes).
test.sjis -- Identical to test.utf8 but sjis encoding
 (111 bytes).
test2.utf8 -- A python coding declaration followed by 
 approximately 50 long lines with mixed English and
 Japanese (4274 bytes).
test2.sjis -- Identical to test2.utf8 but sjis encoding
 (3401 bytes).
Results:
---------------------------------------------------------
$ python3 bm.py test.utf8
test.utf8 / reopen0: total time (10000 reps) was 1.188323
test.utf8 / reopen1: total time (10000 reps) was 1.490757
test.utf8 / reopen2: total time (10000 reps) was 1.766081
test.utf8 / reopen3: total time (10000 reps) was 2.141996
$ python3 bm.py test.sjis
test.sjis / reopen0: total time (10000 reps) was 1.175914
test.sjis / reopen1: total time (10000 reps) was 1.471780
test.sjis / reopen2: total time (10000 reps) was 1.764444
test.sjis / reopen3: total time (10000 reps) was 2.122550
$ python3 bm.py test2.utf8
test2.utf8 / reopen0: total time (10000 reps) was 1.690255
test2.utf8 / reopen1: total time (10000 reps) was 1.996235
test2.utf8 / reopen2: total time (10000 reps) was 2.278798
test2.utf8 / reopen3: total time (10000 reps) was 2.727867
$ python3 bm.py test2.sjis
test2.sjis / reopen0: total time (10000 reps) was 1.841388
test2.sjis / reopen1: total time (10000 reps) was 2.147142
test2.sjis / reopen2: total time (10000 reps) was 2.426701
test2.sjis / reopen3: total time (10000 reps) was 2.873278
----------------------------------------------------------
Here is what happen when a test data file is piped 
into a program using the four methods above:
 $ cat test.utf8 | python3 stdin.py reopen0
 read 102 characters
 $ cat test.utf8 | python3 stdin.py reopen1
 got exception: [Errno 29] Illegal seek
 $ cat test.utf8 | python3 stdin.py reopen2
 read 0 characters
 $ cat test.utf8 | python3 stdin.py reopen3
 read 0 characters
----
[*1] Here is the get_encoding function used above. It is 
a toy simplified python source encoding line reader. Toy,
in that is looks at only one line, doesn't consider a BOM,
etc. It purpose was to allow me to sanity check the benefits
of having a callable encoding parameter.
 def get_encoding (line):
 if isinstance (line, bytes):
 nlpos = line.index(b'\n')
 mo = ENC_PATTERN_B.search (line, 0, nlpos)
 if not mo: return None
 enc = mo.group(1).decode ('latin1')
 else:
 nlpos = line.index('\n')
 mo = ENC_PATTERN_S.search (line, 0, nlpos)
 if not mo: return None
 enc = mo.group(1)
 return enc