I need to replace all non-ASCII (\x00-\x7F) characters with a space. I'm surprised that this is not dead-easy in Python, unless I'm missing something. The following function simply removes all non-ASCII characters:
def remove_non_ascii_1(text):
return ''.join(i for i in text if ord(i)<128)
And this one replaces non-ASCII characters with the amount of spaces as per the amount of bytes in the character code point (i.e. the – character is replaced with 3 spaces):
def remove_non_ascii_2(text):
return re.sub(r'[^\x00-\x7F]',' ', text)
How can I replace all non-ASCII characters with a single space?
Of the myriad of similar SO questions, none address character replacement as opposed to stripping, and additionally address all non-ascii characters not a specific character.
str.join()needs a list (it'll pass over the values twice), and a generator expression will first be converted to one. Giving it a list comprehension is simply faster. See this post.–character is replaced with 3 spaces" in the question implies that the input is a bytestring (not Unicode) and therefore Python 2 is used (otherwise''.joinwould fail). If OP wants a single space per Unicode codepoint then the input should be decoded into Unicode first.