2

I have some data stored as pandas data frame and one of the columns contains text strings in Korean. I would like to process each of these text strings as follows:

my_string = '모질상태불량(피부상태불량, 심하게 야윔), 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성(활력저하)'

Into a list like this:

parsed_text = '모질상태불량, 피부상태불량, 심하게 야윔, 치석심함, 양측 수정체 백탁, 좌측 화농성 눈곱심함(7/22), 코로나음성, 활력저하'

So the problem is to identify cases where a word (or several words) are followed by parentheses with text only (can be one words or several words separated by commas) and replace them by all the words (before and inside parentheses) separated by comma (for later processing). If a word is followed by parentheses containing numbers (as in this case 7/22), it should be kept as it is. If a word is not followed by any parentheses, it should also be kept as it is. Furthermore, I would like to preserve the order of words (as they appeared in the original string).

I can extract text in parentheses by using regex as follows:

corrected_string = re.findall(r'(\w+)\((\D.*?)\)', my_string)

which yields this:

[('모질상태불량', '피부상태불량, 심하게 야윔'), ('코로나음성', '활력저하')] 

But I'm having difficulty creating my resulting string, i.e. replacing my original text with the pattern I've matched. Any suggestions? Thank you.

VLAZ
29.6k9 gold badges65 silver badges88 bronze badges
asked Jan 25, 2019 at 10:08
3
  • 1
    Try this approach. rx = r'(\w+\([\d/]*\))|(\()|\)', def repl(m): if m.group(1): return m.group(1) elif m.group(2): return ", " else: return "" and re.sub(rx, repl, s). Commented Jan 25, 2019 at 10:16
  • Why not answer instead? Commented Jan 25, 2019 at 10:22
  • @Rahul Because it follows a different logic. If it works, I will post. Commented Jan 25, 2019 at 10:26

2 Answers 2

1

You can use re.findall with a pattern that optionally matches a number enclosed in parentheses:

corrected_string = re.findall(r'[^,()]+(?:\([^)]*\d[^)]*\))?', my_string)
answered Jan 25, 2019 at 10:26
Sign up to request clarification or add additional context in comments.

1 Comment

This one is perfect! Thank you.
1

It's little bit clumsy but you can try:

my_string_list = [x.strip() for x in re.split(r"\((?!\d)|(?<!\d)\)|,", my_string) if x]
# you can make string out of list then.
answered Jan 25, 2019 at 10:16

3 Comments

Thanks. Works great! I appreciate it. Just one thing, when you look at the 3rd word in the result list, it left the right parentheses in.
This is just for your understanding. You need to work from here. also see Wiktor Stribiżew's approach.
@ Rahul Thanks, Rahul.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.