I'm running this:
news_train = load_mlcomp('20news-18828', 'train')
vectorizer = TfidfVectorizer(encoding='latin1')
X_train = vectorizer.fit_transform((open(f, errors='ignore').read()
for f in news_train.filenames))
but it got UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 39: invalid continuation byte. at open() function.
I checked the news_train.filenames. It is:
array(['/Users/juby/Downloads/mlcomp/379/train/sci.med/12836-58920',
..., '/Users/juby/Downloads/mlcomp/379/train/sci.space/14129-61228'],
dtype='<U74')
Paths look correct. It may be about dtype or my environment (I'm Mac OSX 10.11), but I can't fix it after I tried many times. Thank you!!!
p.s it's a ML tutorial from http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html#example-text-mlcomp-sparse-document-classification-py
2 Answers 2
Well I found the solution. Using
open(f, encoding = "latin1")
I'm not sure why it only happens on my mac though. Wish to know it.
1 Comment
open with Python 3, your locale is used to determine which encoding to decode the file. On Windows, that will be an 8-bit codepage like, latin1. On Mac and modern Linux, it's likely to be UTF-8. You should never open a file without specifying the encoding.Actually in Python 3+, the open function opens and reads file in default mode 'r' which will decode the file content (on most platform, in UTF-8). Since your files are encoded in latin1, decode them using UTF-8 could cause UnicodeDecodeError. The solution is either opening the files in binary mode ('rb'), or specify the correct encoding (encoding="latin1").
open(f, 'rb').read() # returns `byte` rather than `str`
# or,
open(f, encoding='latin1').read() # returns latin1 decoded `str`
Comments
Explore related questions
See similar questions with these tags.
open(f, mode='rb', errors='ignore').