2

I wanna open some text file in Persian language in python file with bellow code:

 for line in codecs.open('0001.txt',encoding='UTF-8'):
 lines.appends(line)

but it gives me this error :

> Traceback (most recent call last):
 File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 1596, in <module>
 globals = debugger.run(setup['file'], None, None, is_module)
 File "/usr/lib/pycharm-community/helpers/pydev/pydevd.py", line 974, in run
 pydev_imports.execfile(file, globals, locals) # execute the script
 File "/usr/lib/pycharm-community/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
 exec(compile(contents+"\n", file, 'exec'), glob, loc)
 File "/home/nlpuser/Documents/ms/Work/General_Dataset_creator/BijanKhanReader.py", line 24, in <module>
 for lin in codecs.open('corpuses/markaz/0001.txt',encoding='UTF-8'):
 File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 713, in __next__
 return next(self.reader)
 File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 644, in __next__
 line = self.readline()
 File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 557, in readline
 data = self.read(readsize, firstline=True)
 File "/home/nlpuser/anaconda3/envs/tmpy36/lib/python3.6/codecs.py", line 503, in read
 newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 0: invalid continuation byte

what is wrong with this code ?

and his is the output for file :

0001.txt: Non-ISO extended-ASCII text, with CRLF line terminators

asked Jun 25, 2018 at 10:40

1 Answer 1

3

UTF-8 has a very specific format, given that a character can be represented by anywhere from one to four bytes.

If a character is single-byte, it will be represented by 0x00-0x7F. If it is represented by two or more, the leading byte will start with 0xC2 to 0xF4, followed by one to three continuation bytes, in range of 0x80 to 0xBF.

In your case, Python found a character that is in the position of a continuation character (i.e. one of the characters following the lead character), but is 0xE3, which is not a legal continuation character. The problem is likely in your text file, not in your program - either bad encoding, or wrong encoding.

Use hexdump -C <file> or xxd <file> to verify what exact sequence of bytes you have and file <file> to try to guess the encoding, and we might be able to say more.

not2qubit
17.8k10 gold badges121 silver badges165 bronze badges
answered Jun 25, 2018 at 10:50
Sign up to request clarification or add additional context in comments.

5 Comments

this is the output for file : 0001.txt: Non-ISO extended-ASCII text, with CRLF line terminators
... in other words, not UTF-8.
@ Amadan so what is it?
How would I know? You haven’t posted the file. All I can tell you it’s not utf-8.
I (and Google Chrome) believe it is Windows-1256. puts File.read("0002.txt", encoding: Encoding::CP1256).encode(Encoding::UTF_8) should give you something useful.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.