I am working for several days with unicode in C++ now and it is very unclear for me. I have a few questions about its usage and I would be happy if they could be answered. The goal is simply that the output is the string with the proper unicode.
As far as I understood, � is put out when the char is broken. Like when you try to cast a wchat_t to a char.
About my machine OS: kubuntu 19.10
g++ --version
g++ (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
1. Why does this work as std::string should only be capable of storing chars which "é" is not?
setlocale(LC_ALL, "en_US.utf8");
std::cout << "é" << std::endl;
output: é
2. Printing a wchar_t is very strange. Why is the following output as it is?
setlocale(LC_ALL, "en_US.utf8");
wchar_t a = L'é';
std::cout << a << std::endl;
output: 233
setlocale(LC_ALL, "en_US.utf8");
wchar_t a = L'é';
std::wcout << a << std::endl;
output: �
setlocale(LC_ALL, "en_US.utf8");
wchar_t a = L'é';
printf("%lc\n", a);
output: é
setlocale(LC_ALL, "en_US.utf8");
wchar_t a = L'é';
wprintf(L"%lc\n", a);
output: é
PS: setlocale(LC_ALL, "en_US.utf8") is there as suggested by this source. Otherwise, std::wcout would print question marks instead of the proper chars.
1 Answer 1
g++ is using UTF-8 as its default execution charset. You can change it with
-fexec-charset=but that means that your "é" in your first exemple is coded in UTF-8.2.a There is no
operator<<taking anostreamand awchar_t. That means that the later is promoted and displayed as a number (wchar_tlikecharis an integral type).
The other are working as expected. I don't think more explanation is needed. Yet one thing to be aware of is that there is a need to have your environment correctly configured. That's why I asked you to pipe the output in | od -t x1 to check that the output was the expected one. As it is, the issue is a display issue and if you still had it, you'd have to check the configuration of your terminal emulator.
3 Comments
const char arr[] = "россиянᔙинΩaऋ"; std::cout << arr << std::endl; Could you explain to me why the code above works and prints out correctly? This cannot be due to extended ascii like mentioned in one comment to the question as the chars do not belong to that ascii set. This char 'ऋ' for example takes 3 bytes in utf-8. How can that be saved in a char which is has a size of 1 byte? I printed strlen(arr) which puts out 27. Is it possible that a char like 'ऋ' simply gets allocated in three chars?const char arr[] = "россиянᔙинΩaऋ"; std::cout << arr << std::endl; works fine if 1) your source file is saved as UTF-8, 2) your compiler is set to interpret the source code as UTF-8, and outputs the string literal data in the final executable as UTF-8, and 3) your console supports the display of UTF-8 output from your executable. However, you should not rely on all of this being true. If you are using an up-to-date compiler (or at least a C++11 one), consider using the u8 or L prefix on string literals to handle Unicode strings, don't rely on the compiler's charset settings.
"é"is not astd::stringbut a string literal, which is of typeconst char[N]