Return to Answer

Post Timeline

added 175 characters in body

edited May 19, 2018 at 3:33

139.1k
48
270
344

Both std::string and std::wstring must use UTF encoding to represent Unicode. On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent.

For both, size tracks the number of code units instead of the number of logical characterscode points, or grapheme clusters. (Logical characters areA code point is one named Unicode entity, one or more code pointsof which form a grapheme cluster. Grapheme clusters are the visible characters that users interact with, like letters or emojis.)

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of logical charactersgrapheme clusters. Obviously, however, this comes at the cost of using up to 4x more memory.

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.

Finally, UTF strings in human languages that don't use combining characters usually do pretty well with find/regex. I'm not sure about Chinese, but English is one of them.

Both std::string and std::wstring use UTF encoding. On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent.

For both, size tracks the number of code units instead of the number of logical characters. (Logical characters are one or more code points.)

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of logical characters. Obviously, however, this comes at the cost of using up to 4x more memory.

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.

Finally, UTF strings in human languages that don't use combining characters usually do pretty well with find/regex. I'm not sure about Chinese, but English is one of them.

For both, size tracks the number of code units instead of the number of code points, or grapheme clusters. (A code point is one named Unicode entity, one or more of which form a grapheme cluster. Grapheme clusters are the visible characters that users interact with, like letters or emojis.)

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of grapheme clusters. Obviously, however, this comes at the cost of using up to 4x more memory.

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.

Finally, UTF strings in human languages that don't use combining characters usually do pretty well with find/regex. I'm not sure about Chinese, but English is one of them.

added 7 characters in body

Source Link

edited May 18, 2018 at 18:09

zneak

edited May 18, 2018 at 18:09

zneak

139.1k
48
270
344

For both, size tracks the number of code units instead of the number of logical characters. (Logical characters are one or more code points.)

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of logical characters. Obviously, however, this comes at the cost of using up to 4x more memory.

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.

Finally, UTF strings in human languages that don't use combining characters usually do pretty well with find/regex. I'm not sure about Chinese, but English is one of them.

For both, size tracks the number of code units instead of the number of logical characters. (Logical characters are one or more code points.)

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of logical characters. Obviously, however, this comes at the cost of using up to 4x more memory.

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.

Finally, UTF strings in human languages that don't use combining characters usually do pretty well with regex. I'm not sure about Chinese, but English is one of them.

For both, size tracks the number of code units instead of the number of logical characters. (Logical characters are one or more code points.)

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of logical characters. Obviously, however, this comes at the cost of using up to 4x more memory.

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.

Finally, UTF strings in human languages that don't use combining characters usually do pretty well with find/regex. I'm not sure about Chinese, but English is one of them.

deleted 3 characters in body

Source Link

edited May 18, 2018 at 5:15

zneak

edited May 18, 2018 at 5:15

zneak

139.1k
48
270
344

Both std::string and std::wstring use UTF encoding. On macOS specifically, std::string is UTF-8 (8-bit code pointsunits), and std::wstring is UTF-32 (32-bit code pointsunits); note that the size of wchar_t is platform-dependent.

For both, size tracks the number of code pointsunits instead of the number of logical characters. (Logical characters are one or more code points.)

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code pointsunits is often very close to the number of logical characters. Obviously, however, this comes at the cost of using up to 4x more memory.

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.

Finally, UnicodeUTF strings in human languages that don't use combining characters usually do pretty well withfind_first_of and regex. I'm not sure about Chinese, but English is one of them.

Both std::string and std::wstring use UTF encoding. On macOS specifically, std::string is UTF-8 (8-bit code points), and std::wstring is UTF-32 (32-bit code points); note that the size of wchar_t is platform-dependent.

For both, size tracks the number of code points instead of the number of logical characters. (Logical characters are one or more code points.)

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code points is often very close to the number of logical characters. Obviously, however, this comes at the cost of using up to 4x more memory.

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.

Finally, Unicode strings in human languages that don't use combining characters usually do pretty well withfind_first_of and regex. I'm not sure about Chinese, but English is one of them.

For both, size tracks the number of code units instead of the number of logical characters. (Logical characters are one or more code points.)

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of logical characters. Obviously, however, this comes at the cost of using up to 4x more memory.

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.

Finally, UTF strings in human languages that don't use combining characters usually do pretty well with regex. I'm not sure about Chinese, but English is one of them.

added 196 characters in body

Source Link

edited May 18, 2018 at 3:48

zneak

edited May 18, 2018 at 3:48

zneak

139.1k
48
270
344

Source Link

answered May 18, 2018 at 3:41

zneak

answered May 18, 2018 at 3:41

zneak

139.1k
48
270
344

lang-cpp

CollectivesTM on Stack Overflow