I'm using Wikipedia's API to get a simple JSON object where I have the first paragraph of a wiki page, which I later want to read it to the user using text-to-speech. However, some articles have a special transcription of proper pronunciation. For example, when I follow the link for Chihuahua the text in the JSON comes out like this: "The Chihuahua /t\u0283\u026a\u02c8w\u0251\u02d0w\u0251\u02d0/ (Spanish: chihuahue\u00f1o) is the smallest breed of dog" My question is, what it the regular expression that would remove the pronunciation part (and maybe remove any Unicode special chars: \u and 4 chars after that)?
Trying re.sub("\/.+\/", "", test) just adds another \ behind each other \.
1 Answer 1
(Assuming for now that you're using Python because you used re.sub, and you only want to remove /tʃɪˈwɑːwɑː/ because of your example regex.)
First, you need to use Python's raw string notation for regular expression patterns because Python uses backslashes for other things (source); put an r in front of the string literal for your regular expression and your original example might be sufficient.
Anyway, you're on the right track - Unicode doesn't require any special handling for your example case here. You just need to remove everything between the two slashes. I'd also restrict matching whitespace between the slashes so you don't capture everything between two single slashes far apart in the document. The following works for me in the Python 2.7.12 REPL:
>>> re.sub(r'\/[^/\s]+\/\s*', '', "The Chihuahua /t\u0283\u026a\u02c8w\u0251\u02d0w\u0251\u02d0/ (Spanish: chihuahue\u00f1o) is the smallest breed of dog")
'The Chihuahua (Spanish: chihuahue\\u00f1o) is the smallest breed of dog'
Here's that regular expression broken down:
\/ # Match opening slash on the pronunciation expression
[^ # Begin a negated character set
/ # Exclude the forward-slash /
\s # Also exclude all whitespace
]+ # Match one or more character that is not a slash or whitespace
\/ # Match closing slash on the pronunciation expression
\s* # Capture any whitespace that follows, too
/tʃɪˈwɑːwɑː/or also(Spanish: chihuahueño), since it uses unicode for the énye?