1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

Return to Answer

Post Timeline

edited body

edited Mar 8, 2024 at 12:16

As you can see, both the reading of the file 'as is', as well as reading it after re-encoding, renders all missing characters as the same broken charaktercharacter, so it seems impossible to restore the original information for those (the information got lost in the process of Mojibake). I'm afraid that those lines would have to be fixed manually.

As you can see, both the reading of the file 'as is', as well as reading it after re-encoding, renders all missing characters as the same broken charakter, so it seems impossible to restore the original information for those (the information got lost in the process of Mojibake). I'm afraid that those lines would have to be fixed manually.

As you can see, both the reading of the file 'as is', as well as reading it after re-encoding, renders all missing characters as the same broken character, so it seems impossible to restore the original information for those (the information got lost in the process of Mojibake). I'm afraid that those lines would have to be fixed manually.

Source Link

answered Mar 8, 2024 at 11:40

LuHo

answered Mar 8, 2024 at 11:40

LuHo

This is what comes closest to the solution (provided by JosefZ):

You face a double mojibake case (example in Python): 'FrÌ©chette|FranÌ¤ois'.encode( 'cp1252').decode( 'mac-romanian').encode( 'cp1252').decode( 'utf-8') returns 'Fréchette|François'.

Thanks to very helpful input on this question, we have established that this is indeed a case of Mojibake.

I have made progress with the following configuration (python):

Read the current csv with encoding not specified
encode('cp1252').decode('mac-roman').encode('cp1252').decode('iso-8859-16')
Write the result to new csv with encoding 'iso-8859-16' specified

With this configuration, many characters are fixed now, but some are still missing. I don't know if this means I have to decode and encode again (for a total of three times), or if I simply haven't found the correct configuration for a set of two dec-encs yet.

Here is a list of characters that are still broken after the above re-encode:

after re-encoding | just reading | desired outcome | notes
==============================================================================
� | Ì | Á | the lower case á works;
�_ | Ì_ | ü | the upper case Ü works
�_ | Ì_ | ä | I can't confirm whether upper case Ä works
 | æ | I can't confirm whether this exists, but I confirmed one upper case that works; it's possible the file uses 'ae' instead
 | œ | I can't confirm whether any of this exists; it's possible the file uses 'oe' instead
�_ | Ì_ | í | I can't confirm whether upper case Í works
�_ | Ì_ | ó | the upper case Ó works
�_ | Ì_ | þ | the upper case Þ doesn't exist, I think

Conclusion:

lang-py

CollectivesTM on Stack Overflow