Dealing with Unicode Characters

Question 1

I know this question has been asked countless times before, but I can't seem to get any of the solutions working. I've tried using the codecs module, the io module. Nothing seems to work.

I'm scraping some stuff off the web, then logging the details of each item to a text file, yet the script breaks as soon as it first encounters a Unicode character.

AHIMSA Centro de Sanación Pránica, Pranic Healing

Further, I'm not sure where and or when Unicode characters might pop up, which adds an extra level of complexity, so I need an overarching solution and I'm not exactly sure how to deal with potential non-ASCII characters.

I'm not sure if I'll have Python 3.6.5 in the production environment, so the solution has to work with 2.7.

What can I do here? How can I deal with this?

# -*- coding: utf-8 -*-
...
with open('test.txt', 'w') as f:
f.write(str(len(discoverable_cards)) + '\n\n')
 for cnt in range(0, len(discoverable_cards)):
 t = get_time()
 f.write('[ {} ] {}\n'.format(t, discoverable_cards[cnt]))
 f.write('[ {} ] {}\n'.format(t, cnt + 1))
 f.write('[ {} ] {}\n'.format(t, product_type[cnt].text))
 f.write('[ {} ] {}\n'.format(t, titles[cnt].text))
...

Any help would be appreciated!

Question 2

are you using python 2 or 3?

Question 3

@MatthewStory python 2.7 i should've added that

Question 4

If you open the file in wb rather than w mode, you can write to the file as a bytes string. f.write(bytes('[ {} ] {}\n'.format(t, discoverable_cards[cnt]))). That way, your encoding won't get angry

Question 5

@C.Nivs funny, i was using wb before, then switched to w since you can't append as you normally would to wb files :/ how do u append to files created with wb?

Question 6

@C.Nivs i'm still getting the error with wb :/

Question 7

Given that you are in python2.7 you will probably want to explicitly encode all of your strings with a unicode compatible character set like "utf8" before passing them to write, you can do this with a simple encode method:

def safe_encode(str_or_unicode):
 # future py3 compatibility: define unicode, if needed:
 try:
 unicode
 except NameError:
 unicode = str
 if isinstance(str_or_unicode, unicode):
 return str_or_unicode.encode("utf8")
 return str_or_unicode

You would then use it like this:

f.write('[ {} ] {}\n'.format(safe_encode(t), safe_encode(discoverable_cards[cnt])))

Question 8

why is there no dash in return str_or_unicode.encode('utf8')? why isnt utf8 utf-8? maybe thats what i was doing wrong. you see both versions being used everywhere and i had just assumed that utf8 was a typo

Question 9

It should work either way. The most common reason that an encode fails is when you call it on an already encoded string. This is one of the worst design flaws in Python 2.7. When you call "encode" on a string, it first decodes to a unicode object using the ASCII charset, and then calls encode with your passed encoding. That's the reason why safe_encode checks to make sure it's a unicode object before encoding it. Python 3 solves this by only defining encode on unicode objects and decode on strings, so you get an AttributeError if you try to encode a string rather than a weird unicode bug

Question 10

i think i know what youre talking about. when i was trying to encode b4 the strings before, it would throw an error if the code was a string. this is what youre talking about right?

Question 11

Yup. Encoding an already "utf8" encoded string with "utf8" will throw an error in python 2 ... pretty rad.

Question 12

NOTE the 'safe_encode' definition will break if this is ran on python3 (unicode isn't defined in py3 anymore). You can fix like this: try: unicode except NameError: unicode = str BTW, in general it is difficult to keep code compatible with py2 and py3 and also have it work correctly with unicode.

Matthew Story 3,82318 silver badges28 bronze badges · Accepted Answer · 2018-06-26 20:24:50Z

1

Given that you are in python2.7 you will probably want to explicitly encode all of your strings with a unicode compatible character set like "utf8" before passing them to write, you can do this with a simple encode method:

def safe_encode(str_or_unicode):
 # future py3 compatibility: define unicode, if needed:
 try:
 unicode
 except NameError:
 unicode = str
 if isinstance(str_or_unicode, unicode):
 return str_or_unicode.encode("utf8")
 return str_or_unicode

You would then use it like this:

f.write('[ {} ] {}\n'.format(safe_encode(t), safe_encode(discoverable_cards[cnt])))

Share

Improve this answer

edited Jun 26, 2018 at 21:23

Leo K's user avatar

Leo K

5,3843 gold badges14 silver badges27 bronze badges

answered Jun 26, 2018 at 20:24

Matthew Story's user avatar

Matthew Story

3,82318 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

oldboy

oldboy Over a year ago

why is there no dash in return str_or_unicode.encode('utf8')? why isnt utf8 utf-8? maybe thats what i was doing wrong. you see both versions being used everywhere and i had just assumed that utf8 was a typo

2018年06月26日T20:44:12.207Z+00:00

Matthew Story

Matthew Story Over a year ago

It should work either way. The most common reason that an encode fails is when you call it on an already encoded string. This is one of the worst design flaws in Python 2.7. When you call "encode" on a string, it first decodes to a unicode object using the ASCII charset, and then calls encode with your passed encoding. That's the reason why safe_encode checks to make sure it's a unicode object before encoding it. Python 3 solves this by only defining encode on unicode objects and decode on strings, so you get an AttributeError if you try to encode a string rather than a weird unicode bug

2018年06月26日T20:47:49.793Z+00:00

oldboy

oldboy Over a year ago

i think i know what youre talking about. when i was trying to encode b4 the strings before, it would throw an error if the code was a string. this is what youre talking about right?

2018年06月26日T20:53:03.41Z+00:00

Matthew Story

Matthew Story Over a year ago

Yup. Encoding an already "utf8" encoded string with "utf8" will throw an error in python 2 ... pretty rad.

2018年06月26日T20:55:09.157Z+00:00

Leo K

Leo K Over a year ago

NOTE the 'safe_encode' definition will break if this is ran on python3 (unicode isn't defined in py3 anymore). You can fix like this: try: unicode except NameError: unicode = str BTW, in general it is difficult to keep code compatible with py2 and py3 and also have it work correctly with unicode.

2018年06月26日T21:04:27.027Z+00:00

|

CollectivesTM on Stack Overflow

Dealing with Unicode Characters

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related