I downloaded a dataset of facebook messages and it was formatted like this:
f\u00c3\u00b8rste student
It's supposed to be første student but I cant seem to decode it correctly.
I tried:
str = 'f\u00c3\u00b8rste student'
print(str)
# 'fà ̧rste student'
str = 'f\u00c3\u00b8rste student'
print(str.encode('utf-8'))
# b'f\xc3\x83\xc2\xb8rste student'
But it did't work.
1 Answer 1
To undo whatever encoding foulup has taken place, you first need to convert the characters to the bytes with the same ordinals by encoding in ISO-8859-1 (Latin-1) and then after that decoding as UTF-8:
>>> 'f\u00c3\u00b8rste student'.encode('iso-8859-1').decode('utf-8')
'første student'
answered Dec 3, 2018 at 22:16
jwodder
58.1k12 gold badges116 silver badges134 bronze badges
Sign up to request clarification or add additional context in comments.
Comments
lang-py
'ø'is'\u00f8'# -*- coding: utf-8 -*-is specifing the file encoding of the source code only.