UTF-8 decoding library

Question 1

I have to code in an application which is in Unicode UTF-8 in Windows, MSVC 10. I'm aware that the UTF-8 encoded strings would use either 1 or 2 bytes per character. So, my question is : Is std::string suitable for this? If yes, how do I decode the strings? As far as I understand std::string is just an array of bytes and it doesn't provide any decoding logic. How can I know the logical length of the string? How can I extract logical characters from a string? Are there any libraries which helps me to extract logical characters from the string?

e.g : If I have the string "olé" in std::string, I need to know that the length is 3, but not 4.

Question 2

UTF-8 can use up to 4 bytes per character, not just one or two.

Question 3

What exactly you mean by decoding the string? More importantly is why do you need to know the length? It usually makes no sense in Unicode.

Question 4

A commonally used library is ICU - International Components for Unicode

Question 5

Yes, std::string is appropriare but as you’ve noticed it only operates on bytes, not Unicode code points. In that, std::string is an opaque type; this isn’t necessarily bad (in fact, it does have some advantages, see the links below for information) but it makes it necessary to decode the string if you need information about characters.

For the actual handling of UTF-8 (where necessary), you can use the Boost.NoWide library to decode UTF-8.

Furthermore, I suggest reading the UTF-8 everywhere manifesto for some information about the use of UTF-8 vs. other Unicode transformations.

Question 6

std::string makes it really hard to sort strings and test characters in all sorts of ways. Not if you don't need such, that's indeed possible to use UTF-8 strings using std::string. You could also make use of QString (Qt) or CString (MFC...).

Question 7

Thanks @Konrad Rudolph for providing the links. They were useful... From the initial glance, it looks like the Boost.Nowide library doesn't have a stringstream... but, I need to look in detail.

Question 8

@Alexis For that you can (indeed, must) provide custom comparers. That isn’t restricted to std::string though, it’s even true when working with wide characters due to the presence of combining characters etc.

Question 9

I dont think you can get the count of unicode code points in a string using Boost.NoWide or can you (only if all of them are in the BMP)? I see that Boost.NoWide is useful for I/O, but it does not offer functionality for unicode string handling otherwise.

Question 10

I found this amusing from the article: "Windows C++ programmers are educated that Unicode must be done with ‘widechars’. As a result of this mess, they are now among the most confused ones about what is the right thing to do about text". I'm one of them !!

Question 11

First you may want to call the mbstowcs() function to transform the UTF-8 characters to wide characters. Then if you want the result to be 8 bits, you'll have a loss of data in the event you have "Unicode" characters (characters outside of the ISO-8859-1 plane, also called Latin 1.)

Note that the "Windows" encoding is not 1 to 1 equivalent to ISO-8859-1, but in most cases ISO-8859-1 is what people use these days.

Reference: http://www.cplusplus.com/reference/clibrary/cstdlib/mbstowcs/

Okay, if you just want the length in characters, use the mblen() function:

len = mblen(str.c_str(), str.length());

Additional note: an easy way to implementation mblen() is to count the number of bytes that are not between 0x80 and 0xBF since those are part of a multi-bytes sequence. This is particularly useful if you receive a UTF-8 byte sequence over a flaky serial connection.

Question 12

"in most cases ISO-8859-1 is what people use these days". On the interwebs, I see CP1252 mislabelled as ISO-8859-1 fairly frequently. Not sure which one you'd say they were "using" in that case, but it pretty much doesn't matter what "most people" are using, what matters is the minority of people whose text breaks your code ;-)

Question 13

That’s not what OP wants. Why would he want to convert UTF-16 losslessly to single-byte codepoints? The question doesn’t imply this anywhere. Mention of ISO-8859-1 is just misguided. "in most cases [it’s] what people use these days" is completely wrong. In fact, modern browsers actually use a different encoding even if you explicitly request this encoding because almost nobody ever means ISO-8859-1, even if they say so.

Question 14

Or how about converting to utf-16 or utf-32 for internal processing.

Question 15

First of all, I did not say UTF-16. On Windows they use UCS-2. They don't know what UTF-16 is. Second, the first plane of Unicode is ISO-8859-1, whatever you say, that's what it is. Third, CP1252 is specific to Windows and if you convert from UTF-8 you're not going to get CP1252 which is why I mention that you get ISO-8859-1. Then it's your problem to properly select the correct font to render the text later. If you know what encoding you have, you can do it.

Question 16

Yes, the first 256 characters of UCS-2 are the same as UCS-4, UTF-16 and UTF-8 once converted. They're all ISO-8859-1. Converting to another encoding (such as CP1252) requires tables or a library such as iconv (which I recommend you avoid!)

mmmmmm 32.8k28 gold badges92 silver badges124 bronze badges · Accepted Answer · 2012-06-25 10:16:04Z

3

A commonally used library is ICU - International Components for Unicode

Share

Improve this answer

answered Jun 25, 2012 at 10:16

mmmmmm's user avatar

mmmmmm

32.8k28 gold badges92 silver badges124 bronze badges

Sign up to request clarification or add additional context in comments.

CollectivesTM on Stack Overflow

UTF-8 decoding library

3 Answers 3

Comments

5 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Comments

5 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related