1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

Return to Revisions

1 of 2

asked May 7, 2018 at 15:42

python unicode string in regexes

I have an unicode encoded (with BOM) source file and some string that contains unicode symbols. I want to replace all characters that not belong to a defined character set with an underscore.

# coding: utf-8 
import os
import sys
import re
t = "🙂 [°] \n € dsf $ ¬ 1 Ä 2 t34円Ú";
print re.sub(r'[^A-Za-z0-9 !#%&()*+,-./:;<=>?[\]^_{|}~"\'\\]', '_', t, flags=re.UNICODE)

output:

____ [__] _ ___ dsf _ __ 1 __ 2 t3__4__

But each character is replaced by a number of its underscores that may be equal to the bytes in its unicode representation.

asked May 7, 2018 at 15:42

vlad_tepesch

lang-py

CollectivesTM on Stack Overflow