Skip to main content
Stack Overflow
  1. About
  2. For Teams

Return to Question

extension
Source Link
vlad_tepesch
  • 7k
  • 2
  • 46
  • 89

I have an unicode encoded (with BOM) source file and some string that contains unicode symbols. I want to replace all characters that not belong to a defined character set with an underscore.

# coding: utf-8 
import os
import sys
import re
t = "πŸ™‚ [Β°] \n € dsf $ Β¬ 1 Γ„ 2 t34ε††Γš";
print re.sub(r'[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)

output:


output: ____ [__] _ ___ dsf _ __ 1 __ 2 t3__4__
expected: _ [_] _ _ dsf _ _ 1 _ 2 t3_4_

But each character is replaced by a number of its underscores that may be equal to the bytes in its unicode representation.

Maybe an additional problem:

In the actual problem the strings is read from a unicode file by another python module and I do not know if it handles the unicodeness correctly. So may be the string variable is marked as ascii but contains unicode sequences.

I have an unicode encoded (with BOM) source file and some string that contains unicode symbols. I want to replace all characters that not belong to a defined character set with an underscore.

# coding: utf-8 
import os
import sys
import re
t = "πŸ™‚ [Β°] \n € dsf $ Β¬ 1 Γ„ 2 t34ε††Γš";
print re.sub(r'[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)

output:

____ [__] _ ___ dsf _ __ 1 __ 2 t3__4__

But each character is replaced by a number of its underscores that may be equal to the bytes in its unicode representation.

I have an unicode encoded (with BOM) source file and some string that contains unicode symbols. I want to replace all characters that not belong to a defined character set with an underscore.

# coding: utf-8 
import os
import sys
import re
t = "πŸ™‚ [Β°] \n € dsf $ Β¬ 1 Γ„ 2 t34ε††Γš";
print re.sub(r'[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)

output: ____ [__] _ ___ dsf _ __ 1 __ 2 t3__4__
expected: _ [_] _ _ dsf _ _ 1 _ 2 t3_4_

But each character is replaced by a number of its underscores that may be equal to the bytes in its unicode representation.

Maybe an additional problem:

In the actual problem the strings is read from a unicode file by another python module and I do not know if it handles the unicodeness correctly. So may be the string variable is marked as ascii but contains unicode sequences.

Source Link
vlad_tepesch
  • 7k
  • 2
  • 46
  • 89

python unicode string in regexes

I have an unicode encoded (with BOM) source file and some string that contains unicode symbols. I want to replace all characters that not belong to a defined character set with an underscore.

# coding: utf-8 
import os
import sys
import re
t = "πŸ™‚ [Β°] \n € dsf $ Β¬ 1 Γ„ 2 t34ε††Γš";
print re.sub(r'[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)

output:

____ [__] _ ___ dsf _ __ 1 __ 2 t3__4__

But each character is replaced by a number of its underscores that may be equal to the bytes in its unicode representation.

lang-py

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /