2

I know this question has been asked countless times before, but I can't seem to get any of the solutions working. I've tried using the codecs module, the io module. Nothing seems to work.

I'm scraping some stuff off the web, then logging the details of each item to a text file, yet the script breaks as soon as it first encounters a Unicode character.

AHIMSA Centro de Sanación Pránica, Pranic Healing

Further, I'm not sure where and or when Unicode characters might pop up, which adds an extra level of complexity, so I need an overarching solution and I'm not exactly sure how to deal with potential non-ASCII characters.

I'm not sure if I'll have Python 3.6.5 in the production environment, so the solution has to work with 2.7.

What can I do here? How can I deal with this?

# -*- coding: utf-8 -*-
...
with open('test.txt', 'w') as f:
f.write(str(len(discoverable_cards)) + '\n\n')
 for cnt in range(0, len(discoverable_cards)):
 t = get_time()
 f.write('[ {} ] {}\n'.format(t, discoverable_cards[cnt]))
 f.write('[ {} ] {}\n'.format(t, cnt + 1))
 f.write('[ {} ] {}\n'.format(t, product_type[cnt].text))
 f.write('[ {} ] {}\n'.format(t, titles[cnt].text))
...

Any help would be appreciated!

Ashley Mills
53.7k17 gold badges138 silver badges176 bronze badges
asked Jun 26, 2018 at 20:06
6
  • are you using python 2 or 3? Commented Jun 26, 2018 at 20:15
  • @MatthewStory python 2.7 i should've added that Commented Jun 26, 2018 at 20:16
  • If you open the file in wb rather than w mode, you can write to the file as a bytes string. f.write(bytes('[ {} ] {}\n'.format(t, discoverable_cards[cnt]))). That way, your encoding won't get angry Commented Jun 26, 2018 at 20:18
  • @C.Nivs funny, i was using wb before, then switched to w since you can't append as you normally would to wb files :/ how do u append to files created with wb? Commented Jun 26, 2018 at 20:19
  • @C.Nivs i'm still getting the error with wb :/ Commented Jun 26, 2018 at 20:20

1 Answer 1

1

Given that you are in python2.7 you will probably want to explicitly encode all of your strings with a unicode compatible character set like "utf8" before passing them to write, you can do this with a simple encode method:

def safe_encode(str_or_unicode):
 # future py3 compatibility: define unicode, if needed:
 try:
 unicode
 except NameError:
 unicode = str
 if isinstance(str_or_unicode, unicode):
 return str_or_unicode.encode("utf8")
 return str_or_unicode

You would then use it like this:

f.write('[ {} ] {}\n'.format(safe_encode(t), safe_encode(discoverable_cards[cnt])))
Leo K
5,3843 gold badges14 silver badges27 bronze badges
answered Jun 26, 2018 at 20:24
Sign up to request clarification or add additional context in comments.

6 Comments

why is there no dash in return str_or_unicode.encode('utf8')? why isnt utf8 utf-8? maybe thats what i was doing wrong. you see both versions being used everywhere and i had just assumed that utf8 was a typo
It should work either way. The most common reason that an encode fails is when you call it on an already encoded string. This is one of the worst design flaws in Python 2.7. When you call "encode" on a string, it first decodes to a unicode object using the ASCII charset, and then calls encode with your passed encoding. That's the reason why safe_encode checks to make sure it's a unicode object before encoding it. Python 3 solves this by only defining encode on unicode objects and decode on strings, so you get an AttributeError if you try to encode a string rather than a weird unicode bug
i think i know what youre talking about. when i was trying to encode b4 the strings before, it would throw an error if the code was a string. this is what youre talking about right?
Yup. Encoding an already "utf8" encoded string with "utf8" will throw an error in python 2 ... pretty rad.
NOTE the 'safe_encode' definition will break if this is ran on python3 (unicode isn't defined in py3 anymore). You can fix like this: try: unicode except NameError: unicode = str BTW, in general it is difficult to keep code compatible with py2 and py3 and also have it work correctly with unicode.
|

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.