Backpropagation of wrongly (double) encoded CSV

Question 1

I have a CSV file that someone encoded wrongly.

The file is a database of movies with corresponding actors. I downloaded it in order to practise some coding for the so called Bacon number.

It looks like this:

movieId,title,actors
(...)
61,Eye for an Eye (1996),(a ton of other actors)|Dolores VelÌÁzquez|(more actors)
59,The Confessional (1995),(a ton of other actors)|Richard FrÌ©chette|FranÌ¤ois Papineau|Marie Gignac|Normand Daneau|Anne-Marie Cadieux|Suzanne ClÌ©ment|Lynda Beaulieu|Pascal Rollin|Billy Merasty|Paul HÌ©bert|Marthe Turgeon|Adreanne Lepage-Beaulieu|AndrÌ©e-Anne ThÌ©roux-Faille|Rodrigue Proteau|Philippe Paquin|Pierre HÌ©bert|Nathalie D'Anjou|Danielle Fichaud|Jules Philip|Jacques Laroche|Claude-Nicolas Demers|Jean-Philippe CÌ«tÌ©|Tristan Wiseman|Marc-Olivier Tremblay|Jacques Brouillet|Jean-Paul L'Allier|Denis Bernard|RenÌ©e Hudon|Serge Laflamme|Carl Mathieu
(...)

Now as you can see, instead of Umlauts and letters with accents (ÄÖÜ, É, À, Û etc.), the actors have a combination of other special characters instead.

Thanks to very helpful input on this question, we have established that this is indeed a case of Mojibake.

My goal is to programmatically fix the broken characters by decoding and encoding in the correct order.

Question 2

Can you provide a clear list of the characters affected what they are now and what they should be in UTF-8 (with hex codes to avoid any ambiguity)? I expect that with a dozen or so concrete examples someone will be able to spot the pattern. Undoing it programmatically might be ambiguous if some mappings are many to one.

Question 3

Attempting to re-imagin the "usual" mishandlings didn't come up with any combination that produces this specific output, but it certainly looks like a traditional "UTF-8 bytes were wrongly decoded with a 8bit encoding" problem. I just don't know what 8bit encoding.

Question 4

Thanks, @MartinBrown, I added some more examples in the post.

Question 5

It looks like the 8bit encoding Joachim speaks about displays Ì for byte 0xc3. Unfortunately, no 8bits encoding I am used to match here...

Question 6

You face a double mojibake case (example in Python): 'FrÌ©chette|FranÌ¤ois'.encode( 'cp1252').decode( 'mac-romanian').encode( 'cp1252').decode( 'utf-8') returns 'Fréchette|François'. Please edit your question to improve your minimal reproducible example. In particular, share where the csv file comes from (i.e. how it is created), as well as how do you read it...

Question 7

This is what comes closest to the solution (provided by JosefZ):

You face a double mojibake case (example in Python): 'FrÌ©chette|FranÌ¤ois'.encode( 'cp1252').decode( 'mac-romanian').encode( 'cp1252').decode( 'utf-8') returns 'Fréchette|François'.

Thanks to very helpful input on this question, we have established that this is indeed a case of Mojibake.

I have made progress with the following configuration (python):

Read the current csv with encoding not specified
encode('cp1252').decode('mac-roman').encode('cp1252').decode('iso-8859-16')
Write the result to new csv with encoding 'iso-8859-16' specified

With this configuration, many characters are fixed now, but some are still missing. I don't know if this means I have to decode and encode again (for a total of three times), or if I simply haven't found the correct configuration for a set of two dec-encs yet.

Here is a list of characters that are still broken after the above re-encode:

after re-encoding | just reading | desired outcome | notes
==============================================================================
� | Ì | Á | the lower case á works;
�_ | Ì_ | ü | the upper case Ü works
�_ | Ì_ | ä | I can't confirm whether upper case Ä works
 | æ | I can't confirm whether this exists, but I confirmed one upper case that works; it's possible the file uses 'ae' instead
 | œ | I can't confirm whether any of this exists; it's possible the file uses 'oe' instead
�_ | Ì_ | í | I can't confirm whether upper case Í works
�_ | Ì_ | ó | the upper case Ó works
�_ | Ì_ | þ | the upper case Þ doesn't exist, I think

Conclusion:

As you can see, both the reading of the file 'as is', as well as reading it after re-encoding, renders all missing characters as the same broken character, so it seems impossible to restore the original information for those (the information got lost in the process of Mojibake). I'm afraid that those lines would have to be fixed manually.

LuHo 15 bronze badges · Accepted Answer · 2024-03-08 11:40:14Z

This is what comes closest to the solution (provided by JosefZ):

You face a double mojibake case (example in Python): 'FrÌ©chette|FranÌ¤ois'.encode( 'cp1252').decode( 'mac-romanian').encode( 'cp1252').decode( 'utf-8') returns 'Fréchette|François'.

Thanks to very helpful input on this question, we have established that this is indeed a case of Mojibake.

I have made progress with the following configuration (python):

Read the current csv with encoding not specified
encode('cp1252').decode('mac-roman').encode('cp1252').decode('iso-8859-16')
Write the result to new csv with encoding 'iso-8859-16' specified

With this configuration, many characters are fixed now, but some are still missing. I don't know if this means I have to decode and encode again (for a total of three times), or if I simply haven't found the correct configuration for a set of two dec-encs yet.

Here is a list of characters that are still broken after the above re-encode:

after re-encoding | just reading | desired outcome | notes
==============================================================================
� | Ì | Á | the lower case á works;
�_ | Ì_ | ü | the upper case Ü works
�_ | Ì_ | ä | I can't confirm whether upper case Ä works
 | æ | I can't confirm whether this exists, but I confirmed one upper case that works; it's possible the file uses 'ae' instead
 | œ | I can't confirm whether any of this exists; it's possible the file uses 'oe' instead
�_ | Ì_ | í | I can't confirm whether upper case Í works
�_ | Ì_ | ó | the upper case Ó works
�_ | Ì_ | þ | the upper case Þ doesn't exist, I think

Conclusion:

As you can see, both the reading of the file 'as is', as well as reading it after re-encoding, renders all missing characters as the same broken character, so it seems impossible to restore the original information for those (the information got lost in the process of Mojibake). I'm afraid that those lines would have to be fixed manually.

CollectivesTM on Stack Overflow

Backpropagation of wrongly (double) encoded CSV

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related