Remove special Unicode chars using regular expressions?

Question 1

I'm using Wikipedia's API to get a simple JSON object where I have the first paragraph of a wiki page, which I later want to read it to the user using text-to-speech. However, some articles have a special transcription of proper pronunciation. For example, when I follow the link for Chihuahua the text in the JSON comes out like this: "The Chihuahua /t\u0283\u026a\u02c8w\u0251\u02d0w\u0251\u02d0/ (Spanish: chihuahue\u00f1o) is the smallest breed of dog" My question is, what it the regular expression that would remove the pronunciation part (and maybe remove any Unicode special chars: \u and 4 chars after that)?

Trying re.sub("\/.+\/", "", test) just adds another \ behind each other \.

Question 2

What language/regex flavor are you using here? Also, when you say "remove the pronunciation part" do you only mean removing /tʃɪˈwɑːwɑː/ or also (Spanish: chihuahueño), since it uses unicode for the énye?

Question 3

Well, removing the pronunciation is step one. And yes, using Python.

Question 4

(Assuming for now that you're using Python because you used re.sub, and you only want to remove /tʃɪˈwɑːwɑː/ because of your example regex.)

First, you need to use Python's raw string notation for regular expression patterns because Python uses backslashes for other things (source); put an r in front of the string literal for your regular expression and your original example might be sufficient.

Anyway, you're on the right track - Unicode doesn't require any special handling for your example case here. You just need to remove everything between the two slashes. I'd also restrict matching whitespace between the slashes so you don't capture everything between two single slashes far apart in the document. The following works for me in the Python 2.7.12 REPL:

>>> re.sub(r'\/[^/\s]+\/\s*', '', "The Chihuahua /t\u0283\u026a\u02c8w\u0251\u02d0w\u0251\u02d0/ (Spanish: chihuahue\u00f1o) is the smallest breed of dog")
'The Chihuahua (Spanish: chihuahue\\u00f1o) is the smallest breed of dog'

Here's that regular expression broken down:

\/ # Match opening slash on the pronunciation expression
[^ # Begin a negated character set
 / # Exclude the forward-slash /
 \s # Also exclude all whitespace
]+ # Match one or more character that is not a slash or whitespace
\/ # Match closing slash on the pronunciation expression
\s* # Capture any whitespace that follows, too

Brad Buchanan 1,5351 gold badge16 silver badges22 bronze badges · Accepted Answer · 2016-12-17 18:00:06Z

(Assuming for now that you're using Python because you used re.sub, and you only want to remove /tʃɪˈwɑːwɑː/ because of your example regex.)

First, you need to use Python's raw string notation for regular expression patterns because Python uses backslashes for other things (source); put an r in front of the string literal for your regular expression and your original example might be sufficient.

Anyway, you're on the right track - Unicode doesn't require any special handling for your example case here. You just need to remove everything between the two slashes. I'd also restrict matching whitespace between the slashes so you don't capture everything between two single slashes far apart in the document. The following works for me in the Python 2.7.12 REPL:

>>> re.sub(r'\/[^/\s]+\/\s*', '', "The Chihuahua /t\u0283\u026a\u02c8w\u0251\u02d0w\u0251\u02d0/ (Spanish: chihuahue\u00f1o) is the smallest breed of dog")
'The Chihuahua (Spanish: chihuahue\\u00f1o) is the smallest breed of dog'

Here's that regular expression broken down:

\/ # Match opening slash on the pronunciation expression
[^ # Begin a negated character set
 / # Exclude the forward-slash /
 \s # Also exclude all whitespace
]+ # Match one or more character that is not a slash or whitespace
\/ # Match closing slash on the pronunciation expression
\s* # Capture any whitespace that follows, too

CollectivesTM on Stack Overflow

Remove special Unicode chars using regular expressions?

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related