Skip to main content
Stack Overflow
  1. About
  2. For Teams

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

Replace non-ASCII characters with a single space

I need to replace all non-ASCII (\x00-\x7F) characters with a space. I'm surprised that this is not dead-easy in Python, unless I'm missing something. The following function simply removes all non-ASCII characters:

def remove_non_ascii_1(text):
 return ''.join(i for i in text if ord(i)<128)

And this one replaces non-ASCII characters with the amount of spaces as per the amount of bytes in the character code point (i.e. the character is replaced with 3 spaces):

def remove_non_ascii_2(text):
 return re.sub(r'[^\x00-\x7F]',' ', text)

How can I replace all non-ASCII characters with a single space?

Of the myriad of similar SO questions, none address character replacement as opposed to stripping, and additionally address all non-ascii characters not a specific character.

Answer*

Draft saved
Draft discarded
Cancel
4
  • 20
    @dstromberg: slower; str.join() needs a list (it'll pass over the values twice), and a generator expression will first be converted to one. Giving it a list comprehension is simply faster. See this post. Commented Nov 19, 2013 at 18:42
  • 1
    The first piece of code will insert multiple blanks per character if you feed it a UTF-8 byte string. Commented Nov 19, 2013 at 19:13
  • @MarkRansom: I was assuming this to be Python 3. Commented Nov 19, 2013 at 19:15
  • 3
    " character is replaced with 3 spaces" in the question implies that the input is a bytestring (not Unicode) and therefore Python 2 is used (otherwise ''.join would fail). If OP wants a single space per Unicode codepoint then the input should be decoded into Unicode first. Commented Feb 19, 2016 at 17:01

lang-py

AltStyle によって変換されたページ (->オリジナル) /