python unicode string in regexes
I have an unicode encoded (with BOM) source file and some string that contains unicode symbols. I want to replace all characters that not belong to a defined character set with an underscore.
# coding: utf-8
import os
import sys
import re
t = "π [Β°] \n β¬ dsf $ Β¬ 1 Γ 2 t34εΓ";
print re.sub(r'[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)
output:
____ [__] _ ___ dsf _ __ 1 __ 2 t3__4__
But each character is replaced by a number of its underscores that may be equal to the bytes in its unicode representation.
vlad_tepesch
- 7k
- 2
- 46
- 89
lang-py