1

I have a string, decoded by UTF-8 but contains invalid unicode characters.

string = '칼 마르크스 「자본론\udb82\udc55I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론\udb82\udc55의 권수와 쪽수만 표기함―역자'

Is there a way to remove any literal unicode character using regex?

I need to remove those literal unicode characters. Not to decode them into another form.


I am only able to remove them if I include the full literal unicode character, but I am unable to remove any literal unicode character.

re.sub('\udb82', '', string )

'칼 마르크스 「자본론\udc55I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론\udc55의 권수와 쪽수만 표기함―역자'


I know it is possible to replace the literal unicode character by using encode and decode, but I am looking for alternatives that can remove any literal unicode character directly.

string.encode('utf-8', 'replace').decode('utf-8')

'칼 마르크스 「자본론??I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론??의 권수와 쪽수만 표기함―역자'

asked Jul 6, 2020 at 3:44
8
  • The marked question does not solve my problem. I do not want to decode it into another form, I need to remove those literal unicode from the string. Commented Jul 6, 2020 at 4:05
  • Then please refer to this thread. Commented Jul 6, 2020 at 4:32
  • 2
    Something like this - regex101.com/r/n7nRXq/1 ? Commented Jul 6, 2020 at 4:45
  • 1
    It is working. Please see here. The code will work now for both python 2 and 3. Notice the shebang line at the top for python 2 compatibility. Commented Jul 6, 2020 at 5:20
  • 1
    I'm glad it worked. Cheers =) Commented Jul 6, 2020 at 5:29

1 Answer 1

1

You might actually not fiddle around with regular expressions but go for:

string = '칼 마르크스 「자본론\udb82\udc55I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론\udb82\udc55의 권수와 쪽수만 표기함―역자'
print(string.encode('utf-8', 'ignore').decode('utf-8'))

Which yields

칼 마르크스 「자본론I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론의 권수와 쪽수만 표기함―역자
# ^^^ - it's gone!
answered Jul 6, 2020 at 5:42
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.