I have an unicode encoded (with BOM) source file and some string that contains unicode symbols. I want to replace all characters that not belong to a defined character set with an underscore.
# coding: utf-8
import os
import sys
import re
t = "🙂 [°] \n € dsf $ ¬ 1 Ä 2 t34円Ú";
print re.sub(r'[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)
output: ____ [__] _ ___ dsf _ __ 1 __ 2 t3__4__
expected: _ [_] _ _ dsf _ _ 1 _ 2 t3_4_
But each character is replaced by a number of its underscores that may be equal to the bytes in its unicode representation.
Maybe an additional problem:
In the actual problem the strings is read from a unicode file by another python module and I do not know if it handles the unicodeness correctly. So may be the string variable is marked as ascii but contains unicode sequences.
2 Answers 2
Operate on Unicode strings, not byte strings. Your source is encoded as UTF-8 so the characters are encoded from one to four bytes each. Decoding to Unicode strings or using Unicode constants will help. The code also appears to be Python 2-based, so on narrow Python 2 builds (the default on Windows) you'll still have an issue. You could also have issues if you have graphemes built with two or more Unicode code points:
# coding: utf-8
import re
t = u"🙂 [°] \n € dsf $ ¬ 1 Ä 2 t34円Ú";
print re.sub(ur'[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)
Output (on Windows Python 2.7 narrow build):
__ [_] _ _ dsf _ _ 1 _ 2 t3_4_
Note the first emoji still has a double-underscore. Unicode characters greater than U+FFFF are encoded as surrogate pairs. This could be handled by explicitly checking for them. The first code point of a surrogate pair is U+D800 to U+DBFF and the second is U+DC00 to U+DFFF:
# coding: utf-8
import re
t = u"🙂 [°] \n € dsf $ ¬ 1 Ä 2 t34円Ú";
print re.sub(ur'[\ud800-\udbff][\udc00-\udfff]|[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)
Output:
_ [_] _ _ dsf _ _ 1 _ 2 t3_4_
But you'll still have a problem with complex emoji:
# coding: utf-8
import re
t = u"👨🏻👩🏻👧🏻👦🏻";
print re.sub(ur'[\ud800-\udbff][\udc00-\udfff]|[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)
Output:
___________
4 Comments
.decode() the string with the correct encoding.How about:
print(re.sub(r'[^A-Öa-ö0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t))
/[\u007F-\uFFFF]/, works fine in javascript..[$@~]` the whole thing can be replaced with[^\x20-\x7e]but this also will match control char's as well.