I am running into an issue with reading a file that has UTF8 and ASCII character. The problem is I am using seek to only read some part of the data, but I have no idea if I am "read" in the "middle" of an UTF8.
- osx
- python 3.6.6
to simply it, my issue can demoed with following code.
# write some utf-8 to a file
open('/tmp/test.txt', 'w').write(chr(12345)+chr(23456)+chr(34567)+'\n')
data = open('/tmp/test.txt')
data.read() # this works fine. to just demo I can read the file as whole
data.seek(1)
data.read(1) # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
# I can read by seek 3 by 3
data.seek(3)
data.read(1) # this works fine.
I know I can open the file in binary then read it without issue by seeking to any position, however, I need to process the string, so I will end up with same issue when decode into string.
data = open('/tmp/test.txt', 'rb')
data.seek(1)
z = data.seek(3)
z.decode() # will hit same error
without using seek, I can read it correctly even just calling read(1).
data = open('/tmp/test.txt')
data.tell() # 0
data.read(1)
data.tell() # shows 3 even calling read(1)
one thing I can think is after seek to a location, try to read, on UnicodeDecodeError, position = position -1, seek(position), until I can read it correctly.
Is there a better (right) way to handle it?
-
3Randomly reading bytes will indeed not make things UTF-8. What is your actual goal, that you're trying to do that? If you want to step forward or backward some n code points, you'll probably need to scan the full bytes anyway (there might be a package doing that for you).9769953– 97699532018年07月02日 19:07:01 +00:00Commented Jul 2, 2018 at 19:07
1 Answer 1
As the documentation explains, when you seek on text files:
offset must either be a number returned by
TextIOBase.tell(), or zero. Any other offset value produces undefined behaviour.
In practice, what seek(1) actually does is seek 1 byte into the file—which puts it in the middle of a character. So, what ends up happening is similar to this:
>>> s = chr(12345)+chr(23456)+chr(34567)+'\n'
>>> b = s.encode()
>>> b
b'\xe3\x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:]
b'x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 3: invalid start byte
So, seek(3) happens to work, even though it's not legal, because you happen to be seeking to the start of a character. It's equivalent to this:
>>> b[3:].decode()
'宠蜇\n'
If you want to rely on that undocumented behavior to try to seek randomly into the middle of a UTF-8 text file, you can usually get away with it by doing what you suggested. For example:
def readchar(f, pos):
for i in range(pos:pos+5):
try:
f.seek(i)
return f.read(1)
except UnicodeDecodeError:
pass
raise UnicodeDecodeError('Unable to find a UTF-8 start byte')
Or you could use knowledge of the UTF-8 encoding to manually scan for a valid start byte in a binary file:
def readchar(f, pos):
f.seek(pos)
for _ in range(5):
byte = f.read(1)
if byte in range(0, 0x80) or byte in range(0xC0, 0x100):
return byte
raise UnicodeDecodeError('Unable to find a UTF-8 start byte')
However, if you're actually just looking for the next complete line before or after some arbitrary point, that's a whole lot easier.
In UTF-8, the newline character is encoded as a single byte, and the same byte as in ASCII—that is, '\n' encodes to b'\n'. (If you have Windows-style endings, the same is true for return, so '\r\n' also encodes to b'\r\n'.) This is by design, to make it easier to handle this kind of problem.
So, if you open the file in binary mode, you can seek forward or backward until you find a newline byte. And then, you can just use the (binary-file) readline method to read from there until the next newline.
The exact details depend on exactly what rule you want to use here. Also, I'm going to show a stupid, completely unoptimized version that reads a character at a time; in real life you probably want to back up, read, and scan (e.g., with rfind), say, 80 characters at a time, but this is hopefully simpler to understand:
def getline(f, pos, maxpos):
for start in range(pos-1, -1, -1):
f.seek(start)
if f.read(1) == b'\n':
break
else:
f.seek(0)
return f.readline().decode()
Here it is in action:
>>> s = ''.join(f'{i}:\u3039\u5ba0\u8707\n' for i in range(5))
>>> b = s.encode()
>>> f = io.BytesIO(b)
>>> maxlen = len(b)
>>> print(getline(f, 0, maxlen))
0:卄宠蜇
>>> print(getline(f, 1, maxlen))
0:卄宠蜇
>>> print(getline(f, 10, maxlen))
0:卄宠蜇
>>> print(getline(f, 11, maxlen))
0:卄宠蜇
>>> print(getline(f, 12, maxlen))
1:卄宠蜇
>>> print(getline(f, 59, maxlen))
4:卄宠蜇
4 Comments
: characters, etc. This is the only way to parse HTTP headers, Python source files, or anything else where you don't know the encoding until you start reading.