I have an unicode encoded (with BOM) source file and some string that contains unicode symbols. I want to replace all characters that not belong to a defined character set with an underscore.
# coding: utf-8
import os
import sys
import re
t = "π [Β°] \n β¬ dsf $ Β¬ 1 Γ 2 t34εΓ";
print re.sub(r'[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)
output:
output: ____ [__] _ ___ dsf _ __ 1 __ 2 t3__4__
expected: _ [_] _ _ dsf _ _ 1 _ 2 t3_4_
But each character is replaced by a number of its underscores that may be equal to the bytes in its unicode representation.
Maybe an additional problem:
In the actual problem the strings is read from a unicode file by another python module and I do not know if it handles the unicodeness correctly. So may be the string variable is marked as ascii but contains unicode sequences.
I have an unicode encoded (with BOM) source file and some string that contains unicode symbols. I want to replace all characters that not belong to a defined character set with an underscore.
# coding: utf-8
import os
import sys
import re
t = "π [Β°] \n β¬ dsf $ Β¬ 1 Γ 2 t34εΓ";
print re.sub(r'[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)
output:
____ [__] _ ___ dsf _ __ 1 __ 2 t3__4__
But each character is replaced by a number of its underscores that may be equal to the bytes in its unicode representation.
I have an unicode encoded (with BOM) source file and some string that contains unicode symbols. I want to replace all characters that not belong to a defined character set with an underscore.
# coding: utf-8
import os
import sys
import re
t = "π [Β°] \n β¬ dsf $ Β¬ 1 Γ 2 t34εΓ";
print re.sub(r'[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)
output: ____ [__] _ ___ dsf _ __ 1 __ 2 t3__4__
expected: _ [_] _ _ dsf _ _ 1 _ 2 t3_4_
But each character is replaced by a number of its underscores that may be equal to the bytes in its unicode representation.
Maybe an additional problem:
In the actual problem the strings is read from a unicode file by another python module and I do not know if it handles the unicodeness correctly. So may be the string variable is marked as ascii but contains unicode sequences.
python unicode string in regexes
I have an unicode encoded (with BOM) source file and some string that contains unicode symbols. I want to replace all characters that not belong to a defined character set with an underscore.
# coding: utf-8
import os
import sys
import re
t = "π [Β°] \n β¬ dsf $ Β¬ 1 Γ 2 t34εΓ";
print re.sub(r'[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)
output:
____ [__] _ ___ dsf _ __ 1 __ 2 t3__4__
But each character is replaced by a number of its underscores that may be equal to the bytes in its unicode representation.