Edit - Stack Overflow

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Rev

Required fields*

python read file utf-8 decode issue

I am running into an issue with reading a file that has UTF8 and ASCII character. The problem is I am using seek to only read some part of the data, but I have no idea if I am "read" in the "middle" of an UTF8.

osx
python 3.6.6

to simply it, my issue can demoed with following code.

# write some utf-8 to a file
open('/tmp/test.txt', 'w').write(chr(12345)+chr(23456)+chr(34567)+'\n')
data = open('/tmp/test.txt')
data.read() # this works fine. to just demo I can read the file as whole
data.seek(1)
data.read(1) # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
# I can read by seek 3 by 3
data.seek(3)
data.read(1) # this works fine.

I know I can open the file in binary then read it without issue by seeking to any position, however, I need to process the string, so I will end up with same issue when decode into string.

data = open('/tmp/test.txt', 'rb')
data.seek(1)
z = data.seek(3)
z.decode() # will hit same error

without using seek, I can read it correctly even just calling read(1).

data = open('/tmp/test.txt')
data.tell() # 0
data.read(1) 
data.tell() # shows 3 even calling read(1)

one thing I can think is after seek to a location, try to read, on UnicodeDecodeError, position = position -1, seek(position), until I can read it correctly.

Is there a better (right) way to handle it?

Answer*

As the documentation explains, when you [`seek`](https://docs.python.org/3/library/io.html#io.TextIOBase.seek) on text files:

> *offset* must either be a number returned by `TextIOBase.tell()`, or zero. Any other offset value produces undefined behaviour.

In practice, what `seek(1)` actually does is seek 1 byte into the file—which puts it in the middle of a character. So, what ends up happening is similar to this:

 >>> s = chr(12345)+chr(23456)+chr(34567)+'\n'
 >>> b = s.encode()
 >>> b
 b'\xe3\x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
 >>> b[1:]
 b'x80\xb9\xe5\xae\xa0\xe8\x9c\x87\n'
 >>> b[1:].decode()
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 3: invalid start byte

So, `seek(3)` happens to work, even though it's not legal, because you happen to be seeking to the start of a character. It's equivalent to this:

 >>> b[3:].decode()
 '宠蜇\n'

---

If you want to rely on that undocumented behavior to try to seek randomly into the middle of a UTF-8 text file, you can usually get away with it by doing what you suggested. For example:

 def readchar(f, pos):
 for i in range(pos:pos+5):
 try:
 f.seek(i)
 return f.read(1)
 except UnicodeDecodeError:
 pass
 raise UnicodeDecodeError('Unable to find a UTF-8 start byte')

Or you could use knowledge of [the UTF-8 encoding](https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences) to manually scan for a valid start byte in a binary file:

 def readchar(f, pos):
 f.seek(pos)
 for _ in range(5):
 byte = f.read(1)
 if byte in range(0, 0x80) or byte in range(0xC0, 0x100):
 return byte
 raise UnicodeDecodeError('Unable to find a UTF-8 start byte')

---

However, if you're actually just looking for the next complete line before or after some arbitrary point, that's a whole lot easier.

In UTF-8, the newline character is encoded as a single byte, and the same byte as in ASCII—that is, `'\n'` encodes to `b'\n'`. (If you have Windows-style endings, the same is true for return, so `'\r\n'` also encodes to `b'\r\n'`.) This is by design, to make it easier to handle this kind of problem. 

So, if you open the file in binary mode, you can seek forward or backward until you find a newline byte. And then, you can just use the (binary-file) `readline` method to read from there until the next newline.

The exact details depend on exactly what rule you want to use here. Also, I'm going to show a stupid, completely unoptimized version that reads a character at a time; in real life you probably want to back up, read, and scan (e.g., with `rfind`), say, 80 characters at a time, but this is hopefully simpler to understand:

 def getline(f, pos, maxpos):
 for start in range(pos-1, -1, -1):
 f.seek(start)
 if f.read(1) == b'\n':
 break
 else:
 f.seek(0)
 return f.readline().decode()

Here it is in action:

 >>> s = ''.join(f'{i}:\u3039\u5ba0\u8707\n' for i in range(5))
 >>> b = s.encode()
 >>> f = io.BytesIO(b)
 >>> maxlen = len(b)
 >>> print(getline(f, 0, maxlen))
 0:〹宠蜇
 >>> print(getline(f, 1, maxlen))
 0:〹宠蜇
 >>> print(getline(f, 10, maxlen))
 0:〹宠蜇
 >>> print(getline(f, 11, maxlen))
 0:〹宠蜇
 >>> print(getline(f, 12, maxlen))
 1:〹宠蜇
 >>> print(getline(f, 59, maxlen))
 4:〹宠蜇

Draft saved

Draft discarded

Edit Summary*

Cancel

thank you. the utf-encoding range is a big help. So, I can "test" to see if I am on the right position. I am doing "random" seek because I needs to get some data sample from big files (size in T), for example, random 100 lines from the file.

Rui Li
– Rui Li

2018年07月02日 19:57:09 +00:00
Commented Jul 2, 2018 at 19:57
1

@RuiLi If you're looking for random lines, that's a lot easier; that's why you should always explain your actual problem rather than making people guess at it. Let me update the answer to help more.

abarnert
– abarnert

2018年07月02日 20:01:33 +00:00
Commented Jul 2, 2018 at 20:01
Thank you for the detailed explanation. Even my initial question didn’t provide detail of how I am using this code, but I think it’s still worth it. I am learning much more than counting \n. I understand I can count \n in binary mode, and I am reading "whole line" today. Understand how to deal with utf8 will also help me in the future if I do not have option to looking for line break char.

Rui Li
– Rui Li

2018年07月02日 22:49:31 +00:00
Commented Jul 2, 2018 at 22:49
@RuiLi Yeah, this is a useful trick to understand. Both UTF-8 and Latin-1-compatible encodings like Windows-1252 encode every pure-ASCII printable and control characters with the same byte as ASCII, and never encode anything else to those bytes. This means that you can do things like searching for newlines, header-value-separator : characters, etc. This is the only way to parse HTTP headers, Python source files, or anything else where you don't know the encoding until you start reading.

abarnert
– abarnert

2018年07月02日 22:59:21 +00:00
Commented Jul 2, 2018 at 22:59

Add a comment |

How to Edit

Correct minor typos or mistakes
Clarify meaning without changing it
Add related resources or links
Always respect the author’s intent
Don’t use edits to reply to the author

How to Format

create code fences with backticks ` or tildes ~
```
like so
```
add language identifier to highlight code
```python
def function(foo):
print(foo)
```
put returns between paragraphs
for linebreak add 2 spaces at end
_italic_ or **bold**
indent code by 4 spaces
backtick escapes `like _so_`
quote by placing > at start of line
to make links (use https whenever possible)

<https://example.com>

[example](https://example.com)

<a href="https://example.com">example</a>

formatting help »
answering help »

How to Tag

A tag is a keyword or label that categorizes your question with other, similar questions. Choose one or more (up to 5) tags that will help answerers to find and interpret your question.

complete the sentence: my question is about...
use tags that describe things or concepts that are essential, not incidental to your question
favor using existing popular tags
read the descriptions that appear below the tag

If your question is primarily about a topic for which you can't find a tag:

combine multiple words into single-words with hyphens (e.g. python-3.x), up to a maximum of 35 characters
creating new tags is a privilege; if you can't yet create a tag you need, then post this question without it, then ask the community to create it for you

popular tags »

lang-py

CollectivesTM on Stack Overflow

python read file utf-8 decode issue

Answer*