UnicodeDecodeError while loading file in python

Asked 9 years, 5 months ago

Viewed 958 times

I'm running this:

news_train = load_mlcomp('20news-18828', 'train')
vectorizer = TfidfVectorizer(encoding='latin1')
X_train = vectorizer.fit_transform((open(f, errors='ignore').read()
 for f in news_train.filenames))

but it got UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 39: invalid continuation byte. at open() function.

I checked the news_train.filenames. It is:

array(['/Users/juby/Downloads/mlcomp/379/train/sci.med/12836-58920',
 ..., '/Users/juby/Downloads/mlcomp/379/train/sci.space/14129-61228'], 
 dtype='<U74')

Paths look correct. It may be about dtype or my environment (I'm Mac OSX 10.11), but I can't fix it after I tried many times. Thank you!!!

p.s it's a ML tutorial from http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#example-text-mlcomp-sparse-document-classification-py

Improve this question

asked Jul 29, 2016 at 17:16

Denly's user avatar

Denly

9591 gold badge11 silver badges21 bronze badges

1

Python 3? Try open(f, mode='rb', errors='ignore').

Philip Tzou
– Philip Tzou

2016年07月29日 17:56:50 +00:00
Commented Jul 29, 2016 at 17:56
Yes, it is Python3.5. I did it, but i got "binary mode doesn't take an errors argument"

Denly
– Denly

2016年07月29日 19:10:31 +00:00
Commented Jul 29, 2016 at 19:10
Just remove the errors='ignore' can do the trick. Or the answer you posted yourself.

Philip Tzou
– Philip Tzou

2016年07月29日 22:08:29 +00:00
Commented Jul 29, 2016 at 22:08

Add a comment |

2 Answers 2

Sorted by: Reset to default

Well I found the solution. Using

open(f, encoding = "latin1")

I'm not sure why it only happens on my mac though. Wish to know it.

Improve this answer

answered Jul 29, 2016 at 20:07

Denly's user avatar

Denly

9591 gold badge11 silver badges21 bronze badges

1 Comment

Alastair McCormack

Alastair McCormack Over a year ago

When using text mode with open with Python 3, your locale is used to determine which encoding to decode the file. On Windows, that will be an 8-bit codepage like, latin1. On Mac and modern Linux, it's likely to be UTF-8. You should never open a file without specifying the encoding.

2020年01月07日T10:30:34.003Z+00:00

Actually in Python 3+, the open function opens and reads file in default mode 'r' which will decode the file content (on most platform, in UTF-8). Since your files are encoded in latin1, decode them using UTF-8 could cause UnicodeDecodeError. The solution is either opening the files in binary mode ('rb'), or specify the correct encoding (encoding="latin1").

open(f, 'rb').read() # returns `byte` rather than `str`
# or,
open(f, encoding='latin1').read() # returns latin1 decoded `str`

Improve this answer

answered Jul 29, 2016 at 22:16

Philip Tzou's user avatar

Philip Tzou

6,5582 gold badges22 silver badges31 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

UnicodeDecodeError while loading file in python

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related