Skip to main content
Stack Overflow
  1. About
  2. For Teams

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Required fields*

python read file utf-8 decode issue

I am running into an issue with reading a file that has UTF8 and ASCII character. The problem is I am using seek to only read some part of the data, but I have no idea if I am "read" in the "middle" of an UTF8.

  • osx
  • python 3.6.6

to simply it, my issue can demoed with following code.

# write some utf-8 to a file
open('/tmp/test.txt', 'w').write(chr(12345)+chr(23456)+chr(34567)+'\n')
data = open('/tmp/test.txt')
data.read() # this works fine. to just demo I can read the file as whole
data.seek(1)
data.read(1) # UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
# I can read by seek 3 by 3
data.seek(3)
data.read(1) # this works fine. 

I know I can open the file in binary then read it without issue by seeking to any position, however, I need to process the string, so I will end up with same issue when decode into string.

data = open('/tmp/test.txt', 'rb')
data.seek(1)
z = data.seek(3)
z.decode() # will hit same error 

without using seek, I can read it correctly even just calling read(1).

data = open('/tmp/test.txt')
data.tell() # 0
data.read(1) 
data.tell() # shows 3 even calling read(1)

one thing I can think is after seek to a location, try to read, on UnicodeDecodeError, position = position -1, seek(position), until I can read it correctly.

Is there a better (right) way to handle it?

Answer*

Draft saved
Draft discarded
Cancel
4
  • thank you. the utf-encoding range is a big help. So, I can "test" to see if I am on the right position. I am doing "random" seek because I needs to get some data sample from big files (size in T), for example, random 100 lines from the file. Commented Jul 2, 2018 at 19:57
  • 1
    @RuiLi If you're looking for random lines, that's a lot easier; that's why you should always explain your actual problem rather than making people guess at it. Let me update the answer to help more. Commented Jul 2, 2018 at 20:01
  • Thank you for the detailed explanation. Even my initial question didn’t provide detail of how I am using this code, but I think it’s still worth it. I am learning much more than counting \n. I understand I can count \n in binary mode, and I am reading "whole line" today. Understand how to deal with utf8 will also help me in the future if I do not have option to looking for line break char. Commented Jul 2, 2018 at 22:49
  • @RuiLi Yeah, this is a useful trick to understand. Both UTF-8 and Latin-1-compatible encodings like Windows-1252 encode every pure-ASCII printable and control characters with the same byte as ASCII, and never encode anything else to those bytes. This means that you can do things like searching for newlines, header-value-separator : characters, etc. This is the only way to parse HTTP headers, Python source files, or anything else where you don't know the encoding until you start reading. Commented Jul 2, 2018 at 22:59

lang-py

AltStyle によって変換されたページ (->オリジナル) /