I have a string, decoded by UTF-8 but contains invalid unicode characters.
string = '칼 마르크스 「자본론\udb82\udc55I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론\udb82\udc55의 권수와 쪽수만 표기함―역자'
Is there a way to remove any literal unicode character using regex?
I need to remove those literal unicode characters. Not to decode them into another form.
I am only able to remove them if I include the full literal unicode character, but I am unable to remove any literal unicode character.
re.sub('\udb82', '', string )
'칼 마르크스 「자본론\udc55I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론\udc55의 권수와 쪽수만 표기함―역자'
I know it is possible to replace the literal unicode character by using encode and decode, but I am looking for alternatives that can remove any literal unicode character directly.
string.encode('utf-8', 'replace').decode('utf-8')
'칼 마르크스 「자본론??I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론??의 권수와 쪽수만 표기함―역자'
-
The marked question does not solve my problem. I do not want to decode it into another form, I need to remove those literal unicode from the string.cylim– cylim2020年07月06日 04:05:07 +00:00Commented Jul 6, 2020 at 4:05
-
Then please refer to this thread.metatoaster– metatoaster2020年07月06日 04:32:22 +00:00Commented Jul 6, 2020 at 4:32
-
2Something like this - regex101.com/r/n7nRXq/1 ?Jan– Jan2020年07月06日 04:45:16 +00:00Commented Jul 6, 2020 at 4:45
-
1It is working. Please see here. The code will work now for both python 2 and 3. Notice the shebang line at the top for python 2 compatibility.user7571182– user75711822020年07月06日 05:20:22 +00:00Commented Jul 6, 2020 at 5:20
-
1I'm glad it worked. Cheers =)user7571182– user75711822020年07月06日 05:29:41 +00:00Commented Jul 6, 2020 at 5:29
1 Answer 1
You might actually not fiddle around with regular expressions but go for:
string = '칼 마르크스 「자본론\udb82\udc55I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론\udb82\udc55의 권수와 쪽수만 표기함―역자'
print(string.encode('utf-8', 'ignore').decode('utf-8'))
Which yields
칼 마르크스 「자본론I, 김수행 역 비봉출판사 108쪽―이하에서는 「자본론의 권수와 쪽수만 표기함―역자
# ^^^ - it's gone!