1

I'm using Wikipedia's API to get a simple JSON object where I have the first paragraph of a wiki page, which I later want to read it to the user using text-to-speech. However, some articles have a special transcription of proper pronunciation. For example, when I follow the link for Chihuahua the text in the JSON comes out like this: "The Chihuahua /t\u0283\u026a\u02c8w\u0251\u02d0w\u0251\u02d0/ (Spanish: chihuahue\u00f1o) is the smallest breed of dog" My question is, what it the regular expression that would remove the pronunciation part (and maybe remove any Unicode special chars: \u and 4 chars after that)?

Trying re.sub("\/.+\/", "", test) just adds another \ behind each other \.

asked Dec 17, 2016 at 17:18
2
  • What language/regex flavor are you using here? Also, when you say "remove the pronunciation part" do you only mean removing /tʃɪˈwɑːwɑː/ or also (Spanish: chihuahueño), since it uses unicode for the énye? Commented Dec 17, 2016 at 17:37
  • Well, removing the pronunciation is step one. And yes, using Python. Commented Dec 17, 2016 at 18:10

1 Answer 1

1

(Assuming for now that you're using Python because you used re.sub, and you only want to remove /tʃɪˈwɑːwɑː/ because of your example regex.)

First, you need to use Python's raw string notation for regular expression patterns because Python uses backslashes for other things (source); put an r in front of the string literal for your regular expression and your original example might be sufficient.

Anyway, you're on the right track - Unicode doesn't require any special handling for your example case here. You just need to remove everything between the two slashes. I'd also restrict matching whitespace between the slashes so you don't capture everything between two single slashes far apart in the document. The following works for me in the Python 2.7.12 REPL:

>>> re.sub(r'\/[^/\s]+\/\s*', '', "The Chihuahua /t\u0283\u026a\u02c8w\u0251\u02d0w\u0251\u02d0/ (Spanish: chihuahue\u00f1o) is the smallest breed of dog")
'The Chihuahua (Spanish: chihuahue\\u00f1o) is the smallest breed of dog'

Here's that regular expression broken down:

\/ # Match opening slash on the pronunciation expression
[^ # Begin a negated character set
 / # Exclude the forward-slash /
 \s # Also exclude all whitespace
]+ # Match one or more character that is not a slash or whitespace
\/ # Match closing slash on the pronunciation expression
\s* # Capture any whitespace that follows, too
answered Dec 17, 2016 at 18:00
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.