python read file utf-8 decode issue

Question 1

I am running into an issue with reading a file that has UTF8 and ASCII character. The problem is I am using seek to only read some part of the data, but I have no idea if I am "read" in the "middle" of an UTF8.

osx
python 3.6.6

to simply it, my issue can demoed with following code.

# write some utf-8 to a file
open('/tmp/test.txt', 'w').write(chr(12345)+chr(23456)+chr(34567)+'\n')
data = open('/tmp/test.txt')
data.read() # this works fine. to just demo I can read the file as whole
data.seek(1)
data.read(1) # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
# I can read by seek 3 by 3
data.seek(3)
data.read(1) # this works fine.

I know I can open the file in binary then read it without issue by seeking to any position, however, I need to process the string, so I will end up with same issue when decode into string.

data = open('/tmp/test.txt', 'rb')
data.seek(1)
z = data.seek(3)
z.decode() # will hit same error

without using seek, I can read it correctly even just calling read(1).

data = open('/tmp/test.txt')
data.tell() # 0
data.read(1) 
data.tell() # shows 3 even calling read(1)

one thing I can think is after seek to a location, try to read, on UnicodeDecodeError, position = position -1, seek(position), until I can read it correctly.

Is there a better (right) way to handle it?

Question 2

Randomly reading bytes will indeed not make things UTF-8. What is your actual goal, that you're trying to do that? If you want to step forward or backward some n code points, you'll probably need to scan the full bytes anyway (there might be a package doing that for you).

Question 3

As the documentation explains, when you seek on text files:

offset must either be a number returned by TextIOBase.tell(), or zero. Any other offset value produces undefined behaviour.

In practice, what seek(1) actually does is seek 1 byte into the file—which puts it in the middle of a character. So, what ends up happening is similar to this:

>>> s = chr(12345)+chr(23456)+chr(34567)+'\n'
>>> b = s.encode()
>>> b
b'\xe3\x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:]
b'x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 3: invalid start byte

So, seek(3) happens to work, even though it's not legal, because you happen to be seeking to the start of a character. It's equivalent to this:

>>> b[3:].decode()
'宠蜇\n'

If you want to rely on that undocumented behavior to try to seek randomly into the middle of a UTF-8 text file, you can usually get away with it by doing what you suggested. For example:

def readchar(f, pos):
 for i in range(pos:pos+5):
 try:
 f.seek(i)
 return f.read(1)
 except UnicodeDecodeError:
 pass
 raise UnicodeDecodeError('Unable to find a UTF-8 start byte')

Or you could use knowledge of the UTF-8 encoding to manually scan for a valid start byte in a binary file:

def readchar(f, pos):
 f.seek(pos)
 for _ in range(5):
 byte = f.read(1)
 if byte in range(0, 0x80) or byte in range(0xC0, 0x100):
 return byte
 raise UnicodeDecodeError('Unable to find a UTF-8 start byte')

However, if you're actually just looking for the next complete line before or after some arbitrary point, that's a whole lot easier.

In UTF-8, the newline character is encoded as a single byte, and the same byte as in ASCII—that is, '\n' encodes to b'\n'. (If you have Windows-style endings, the same is true for return, so '\r\n' also encodes to b'\r\n'.) This is by design, to make it easier to handle this kind of problem.

So, if you open the file in binary mode, you can seek forward or backward until you find a newline byte. And then, you can just use the (binary-file) readline method to read from there until the next newline.

The exact details depend on exactly what rule you want to use here. Also, I'm going to show a stupid, completely unoptimized version that reads a character at a time; in real life you probably want to back up, read, and scan (e.g., with rfind), say, 80 characters at a time, but this is hopefully simpler to understand:

def getline(f, pos, maxpos):
 for start in range(pos-1, -1, -1):
 f.seek(start)
 if f.read(1) == b'\n':
 break
 else:
 f.seek(0)
 return f.readline().decode()

Here it is in action:

>>> s = ''.join(f'{i}:\u3039\u5ba0\u8707\n' for i in range(5))
>>> b = s.encode()
>>> f = io.BytesIO(b)
>>> maxlen = len(b)
>>> print(getline(f, 0, maxlen))
0:卄宠蜇
>>> print(getline(f, 1, maxlen))
0:卄宠蜇
>>> print(getline(f, 10, maxlen))
0:卄宠蜇
>>> print(getline(f, 11, maxlen))
0:卄宠蜇
>>> print(getline(f, 12, maxlen))
1:卄宠蜇
>>> print(getline(f, 59, maxlen))
4:卄宠蜇

Question 4

thank you. the utf-encoding range is a big help. So, I can "test" to see if I am on the right position. I am doing "random" seek because I needs to get some data sample from big files (size in T), for example, random 100 lines from the file.

Question 5

@RuiLi If you're looking for random lines, that's a lot easier; that's why you should always explain your actual problem rather than making people guess at it. Let me update the answer to help more.

Question 6

Thank you for the detailed explanation. Even my initial question didn’t provide detail of how I am using this code, but I think it’s still worth it. I am learning much more than counting \n. I understand I can count \n in binary mode, and I am reading "whole line" today. Understand how to deal with utf8 will also help me in the future if I do not have option to looking for line break char.

Question 7

@RuiLi Yeah, this is a useful trick to understand. Both UTF-8 and Latin-1-compatible encodings like Windows-1252 encode every pure-ASCII printable and control characters with the same byte as ASCII, and never encode anything else to those bytes. This means that you can do things like searching for newlines, header-value-separator : characters, etc. This is the only way to parse HTTP headers, Python source files, or anything else where you don't know the encoding until you start reading.

abarnert 368k54 gold badges627 silver badges692 bronze badges · Accepted Answer · 2018-07-02 19:12:55Z

As the documentation explains, when you seek on text files:

offset must either be a number returned by TextIOBase.tell(), or zero. Any other offset value produces undefined behaviour.

In practice, what seek(1) actually does is seek 1 byte into the file—which puts it in the middle of a character. So, what ends up happening is similar to this:

>>> s = chr(12345)+chr(23456)+chr(34567)+'\n'
>>> b = s.encode()
>>> b
b'\xe3\x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:]
b'x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
>>> b[1:].decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 3: invalid start byte

So, seek(3) happens to work, even though it's not legal, because you happen to be seeking to the start of a character. It's equivalent to this:

>>> b[3:].decode()
'宠蜇\n'

If you want to rely on that undocumented behavior to try to seek randomly into the middle of a UTF-8 text file, you can usually get away with it by doing what you suggested. For example:

def readchar(f, pos):
 for i in range(pos:pos+5):
 try:
 f.seek(i)
 return f.read(1)
 except UnicodeDecodeError:
 pass
 raise UnicodeDecodeError('Unable to find a UTF-8 start byte')

Or you could use knowledge of the UTF-8 encoding to manually scan for a valid start byte in a binary file:

def readchar(f, pos):
 f.seek(pos)
 for _ in range(5):
 byte = f.read(1)
 if byte in range(0, 0x80) or byte in range(0xC0, 0x100):
 return byte
 raise UnicodeDecodeError('Unable to find a UTF-8 start byte')

However, if you're actually just looking for the next complete line before or after some arbitrary point, that's a whole lot easier.

In UTF-8, the newline character is encoded as a single byte, and the same byte as in ASCII—that is, '\n' encodes to b'\n'. (If you have Windows-style endings, the same is true for return, so '\r\n' also encodes to b'\r\n'.) This is by design, to make it easier to handle this kind of problem.

So, if you open the file in binary mode, you can seek forward or backward until you find a newline byte. And then, you can just use the (binary-file) readline method to read from there until the next newline.

The exact details depend on exactly what rule you want to use here. Also, I'm going to show a stupid, completely unoptimized version that reads a character at a time; in real life you probably want to back up, read, and scan (e.g., with rfind), say, 80 characters at a time, but this is hopefully simpler to understand:

def getline(f, pos, maxpos):
 for start in range(pos-1, -1, -1):
 f.seek(start)
 if f.read(1) == b'\n':
 break
 else:
 f.seek(0)
 return f.readline().decode()

Here it is in action:

>>> s = ''.join(f'{i}:\u3039\u5ba0\u8707\n' for i in range(5))
>>> b = s.encode()
>>> f = io.BytesIO(b)
>>> maxlen = len(b)
>>> print(getline(f, 0, maxlen))
0:卄宠蜇
>>> print(getline(f, 1, maxlen))
0:卄宠蜇
>>> print(getline(f, 10, maxlen))
0:卄宠蜇
>>> print(getline(f, 11, maxlen))
0:卄宠蜇
>>> print(getline(f, 12, maxlen))
1:卄宠蜇
>>> print(getline(f, 59, maxlen))
4:卄宠蜇

thank you. the utf-encoding range is a big help. So, I can "test" to see if I am on the right position. I am doing "random" seek because I needs to get some data sample from big files (size in T), for example, random 100 lines from the file.
@RuiLi If you're looking for random lines, that's a lot easier; that's why you should always explain your actual problem rather than making people guess at it. Let me update the answer to help more.
Thank you for the detailed explanation. Even my initial question didn’t provide detail of how I am using this code, but I think it’s still worth it. I am learning much more than counting \n. I understand I can count \n in binary mode, and I am reading "whole line" today. Understand how to deal with utf8 will also help me in the future if I do not have option to looking for line break char.
@RuiLi Yeah, this is a useful trick to understand. Both UTF-8 and Latin-1-compatible encodings like Windows-1252 encode every pure-ASCII printable and control characters with the same byte as ASCII, and never encode anything else to those bytes. This means that you can do things like searching for newlines, header-value-separator : characters, etc. This is the only way to parse HTTP headers, Python source files, or anything else where you don't know the encoding until you start reading.

CollectivesTM on Stack Overflow

python read file utf-8 decode issue

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related