7

I have html file to read parse etc, it's encode on unicode (I saw it with the notepad) but when I tried

infile = open("path", "r") 
infile.read()

it fails and I had the famous error :

UnicodeEncodeError: 'charmap' codec can't encode characters in position xx: character maps to undefined

So for test I tried to copy paste the contain of the file in a new one and save it in utf-8 and then tried to open it with codecs like this :

inFile = codecs.open("path", "r", encoding="utf-8")
outputStream = inFile.read()

But I get this error message :

UnicodeEncodeError : 'charmap' codec can't encode character u'\ufeff' in position 0: charcater maps to undefined

I really don't understand because I was created this file in utf8.

Remi Guan
22.5k17 gold badges68 silver badges90 bronze badges
asked Sep 21, 2015 at 12:15
7
  • 1
    That's a unicode BOM it seems to be utf-16, can your try passing encoding='utf-16' Commented Sep 21, 2015 at 12:17
  • @EdChum I tried and the response is : > UnicodeError: UTF-16 stream does not start with BOM Commented Sep 21, 2015 at 12:23
  • Can you post the raw input data just the first few lines or a link to the file, thanks. Another option is to skip the first couple of characters but really it should be able to open this without issue Commented Sep 21, 2015 at 12:45
  • It's an htm file from outlook which start like this : "<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="schemas.microsoft.com/office/2004/12/omml" xmlns="w3.org/TR/REC-html40">" Commented Sep 21, 2015 at 12:58
  • Are you sure you're getting that error during .read()!? The error during read would be "can't decode". It sounds like you're getting an error when writing to a file or printing to the terminal Commented Sep 23, 2015 at 7:46

3 Answers 3

6

UnicodeEncodeError suggests that the code fails while encoding Unicode text to bytes i.e., your actual code tries to print to Windows console. See Python, Unicode, and the Windows console.


The link above fixes UnicodeEncodeError. The next issue is to find out what character encoding is used by the text in your "path" file. If notepad.exe shows the text correctly then it means that it is either encoded using locale.getprefferedencoding(False) (something like cp1252 on Windows) or the file has BOM.

If you are sure that the encoding is utf-8 then pass it to open() directly. Don't use codecs.open():

with open('path', encoding='utf-8') as file:
 html = file.read()

Sometimes, the input may contain text encoded using multiple (inconsistent) encodings e.g., smart quotes may be encoded using cp1252 while the rest of html is utf-8 -- you could fix it using bs4.UnicodeDammit. See also A good way to get the charset/encoding of an HTTP response in Python

answered Sep 23, 2015 at 19:38
Sign up to request clarification or add additional context in comments.

5 Comments

If Notepad says "Unicode" (as the OP said) it means UTF-16. The other encodings are usually called "ANSI" (cp1252 and friends) and "UTF-8" (which is UTF-8 with BOM).
@roeland: yes. "it's encode on unicode (I saw it with the notepad)" from the question can be interpreted that way. The issue with that theory is that codecs.open("path", encoding='utf-8').read() returns u'\ufeff' i.e., utf-8-sig is more likely. 'utf-8' encoding fails for both BOM_UTF16_BE and BOM_UTF16_LE.
Yeah, the question is a bit confusing as it involves two files, the original file in "Unicode", and the file he re-saved as "UTF-8".
@roeland: anyway the issue is UnicodeEncodeError i.e., when OP tries to print Unicode text to Windows console.
Aha, I see. That was subtle
1

The original file probably uses utf-16 (Windows uses the term UNICODE for that encoding).

UTF-8 encoded files on Windows normally starts with a magic number b"\xef\xbb\xbf" (the UTF-8 encoding of U+FEFF) so applications reading that file know it was saved as UTF-8 and not some ANSI code page. utf8-sig which will automatically discard that character.

answered Sep 21, 2015 at 23:33

1 Comment

As a side-note: Don't use codecs.open. On Py3, you can pass an encoding argument to regular open, and on Py2.7, you can import io.open (which is the same as Py3's built-in open) and do the same. codecs.open has some dumb quirks (e.g. doesn't do universal new line handling).
1

In anticipation of the OP to update question to reflect the actual problem, the issue is caused by the encoding of the terminal not being defined.

The Windows console is notoriously poor when it comes to Unicode support. For ultimate support, see https://pypi.python.org/pypi/win_unicode_console. Essentially, install "win_unicode_console" (pip install win_unicode_console). Then at the top of your code:

import win_unicode_console
win_unicode_console.enable()

You may also need to use a suitable font - See https://stackoverflow.com/a/5750227/1554386

As you're using an input with a UTF-8 BOM, you should use the utf_8_sig codec so that the BOM is stripped before working with the contents.

As this is Python 3, you don't need to use the codecs module to set encoding when using open().

Putting it together it would look like:

import win_unicode_console
win_unicode_console.enable()
infile = open("path", "r", encoding="utf_8_sig")
answered Sep 23, 2015 at 8:59

2 Comments

it is best to avoid modifying the script. You could run it using run module instead (a part of win-unicode-console): py -m run your-unicode-printing-script.py or if it is appropriate in your case then put win_unicode_console.enable() call into sitecustomize or usercustomize modules.
Is this package still maintained? What does it do specifically?

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.