Remove literal invalid unicode characters in a string

Asked 5 years, 6 months ago

Viewed 504 times

I have a string, decoded by UTF-8 but contains invalid unicode characters.

string = '칼 마르크스 「자본론\udb82\udc55I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론\udb82\udc55의 권수와 쪽수만 표기함―역자'

Is there a way to remove any literal unicode character using regex?

I need to remove those literal unicode characters. Not to decode them into another form.

I am only able to remove them if I include the full literal unicode character, but I am unable to remove any literal unicode character.

re.sub('\udb82', '', string )

'칼 마르크스 「자본론\udc55I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론\udc55의 권수와 쪽수만 표기함―역자'

I know it is possible to replace the literal unicode character by using encode and decode, but I am looking for alternatives that can remove any literal unicode character directly.

string.encode('utf-8', 'replace').decode('utf-8')

'칼 마르크스 「자본론??I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론??의 권수와 쪽수만 표기함―역자'

Improve this question

edited Jul 6, 2020 at 4:08

asked Jul 6, 2020 at 3:44

cylim's user avatar

cylim

5421 gold badge7 silver badges15 bronze badges

The marked question does not solve my problem. I do not want to decode it into another form, I need to remove those literal unicode from the string.

cylim
– cylim

2020年07月06日 04:05:07 +00:00
Commented Jul 6, 2020 at 4:05
Then please refer to this thread.

metatoaster
– metatoaster

2020年07月06日 04:32:22 +00:00
Commented Jul 6, 2020 at 4:32
2

Something like this - regex101.com/r/n7nRXq/1 ?

Jan
– Jan

2020年07月06日 04:45:16 +00:00
Commented Jul 6, 2020 at 4:45
1

It is working. Please see here. The code will work now for both python 2 and 3. Notice the shebang line at the top for python 2 compatibility.

user7571182
– user7571182

2020年07月06日 05:20:22 +00:00
Commented Jul 6, 2020 at 5:20
1

I'm glad it worked. Cheers =)

user7571182
– user7571182

2020年07月06日 05:29:41 +00:00
Commented Jul 6, 2020 at 5:29

| Show 3 more comments

1 Answer 1

Sorted by: Reset to default

You might actually not fiddle around with regular expressions but go for:

string = '칼 마르크스 「자본론\udb82\udc55I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론\udb82\udc55의 권수와 쪽수만 표기함―역자'
print(string.encode('utf-8', 'ignore').decode('utf-8'))

Which yields

칼 마르크스 「자본론I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론의 권수와 쪽수만 표기함―역자
# ^^^ - it's gone!

Improve this answer

answered Jul 6, 2020 at 5:42

Jan's user avatar

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Remove literal invalid unicode characters in a string

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related