UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 34: invalid continuation byte

Question 1

I wanna open some text file in Persian language in python file with bellow code:

 for line in codecs.open('0001.txt',encoding='UTF-8'):
 lines.appends(line)

but it gives me this error :

> Traceback (most recent call last):
 File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 1596, in <module>
 globals = debugger.run(setup['file'], None, None, is_module)
 File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 974, in run
 pydev_imports.execfile(file, globals, locals) # execute the script
 File "/usr/lib/pycharm-community/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
 exec(compile(contents+"\n", file, 'exec'), glob, loc)
 File "/home/nlpuser/Documents/ms/Work/General_Dataset_creator/BijanKhanReader.py", line 24, in <module>
 for lin in codecs.open('corpuses/markaz/0001.txt',encoding='UTF-8'):
 File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 713, in __next__
 return next(self.reader)
 File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 644, in __next__
 line = self.readline()
 File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 557, in readline
 data = self.read(readsize, firstline=True)
 File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 503, in read
 newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 0: invalid continuation byte

what is wrong with this code ?

and his is the output for file :

0001.txt: Non-ISO extended-ASCII text, with CRLF line terminators

Question 2

UTF-8 has a very specific format, given that a character can be represented by anywhere from one to four bytes.

If a character is single-byte, it will be represented by 0x00-0x7F. If it is represented by two or more, the leading byte will start with 0xC2 to 0xF4, followed by one to three continuation bytes, in range of 0x80 to 0xBF.

In your case, Python found a character that is in the position of a continuation character (i.e. one of the characters following the lead character), but is 0xE3, which is not a legal continuation character. The problem is likely in your text file, not in your program - either bad encoding, or wrong encoding.

Use hexdump -C <file> or xxd <file> to verify what exact sequence of bytes you have and file <file> to try to guess the encoding, and we might be able to say more.

Question 3

this is the output for file : 0001.txt: Non-ISO extended-ASCII text, with CRLF line terminators

Question 4

... in other words, not UTF-8.

Question 5

@ Amadan so what is it?

Question 6

How would I know? You haven’t posted the file. All I can tell you it’s not utf-8.

Question 7

I (and Google Chrome) believe it is Windows-1256. puts File.read("0002.txt", encoding: Encoding::CP1256).encode(Encoding::UTF_8) should give you something useful.

Amadan 200k23 gold badges254 silver badges321 bronze badges · Accepted Answer · 2018-06-25 10:50:00Z

UTF-8 has a very specific format, given that a character can be represented by anywhere from one to four bytes.

If a character is single-byte, it will be represented by 0x00-0x7F. If it is represented by two or more, the leading byte will start with 0xC2 to 0xF4, followed by one to three continuation bytes, in range of 0x80 to 0xBF.

In your case, Python found a character that is in the position of a continuation character (i.e. one of the characters following the lead character), but is 0xE3, which is not a legal continuation character. The problem is likely in your text file, not in your program - either bad encoding, or wrong encoding.

Use hexdump -C <file> or xxd <file> to verify what exact sequence of bytes you have and file <file> to try to guess the encoding, and we might be able to say more.

this is the output for file : 0001.txt: Non-ISO extended-ASCII text, with CRLF line terminators
How would I know? You haven’t posted the file. All I can tell you it’s not utf-8.
I (and Google Chrome) believe it is Windows-1256. puts File.read("0002.txt", encoding: Encoding::CP1256).encode(Encoding::UTF_8) should give you something useful.

CollectivesTM on Stack Overflow

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 34: invalid continuation byte

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related