C++ - Why isn't the unicode output correct?

Question 1

I am working for several days with unicode in C++ now and it is very unclear for me. I have a few questions about its usage and I would be happy if they could be answered. The goal is simply that the output is the string with the proper unicode.

As far as I understood, � is put out when the char is broken. Like when you try to cast a wchat_t to a char.

About my machine OS: kubuntu 19.10

g++ --version
g++ (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

1. Why does this work as std::string should only be capable of storing chars which "é" is not?

setlocale(LC_ALL, "en_US.utf8");
std::cout << "é" << std::endl;
output: é

2. Printing a wchar_t is very strange. Why is the following output as it is?

setlocale(LC_ALL, "en_US.utf8");
wchar_t a = L'é';
std::cout << a << std::endl;
output: 233

setlocale(LC_ALL, "en_US.utf8");
wchar_t a = L'é';
std::wcout << a << std::endl;
output: �

setlocale(LC_ALL, "en_US.utf8");
wchar_t a = L'é';
printf("%lc\n", a);
output: é

setlocale(LC_ALL, "en_US.utf8");
wchar_t a = L'é';
wprintf(L"%lc\n", a);
output: é

PS: setlocale(LC_ALL, "en_US.utf8") is there as suggested by this source. Otherwise, std::wcout would print question marks instead of the proper chars.

Question 2

Extended ascii?

Question 3

Also note that "é" is not a std::string but a string literal, which is of type const char[N]

Question 4

@NathanOliver-ReinstateMonica I never heard of extended ascii so thanks! Nevertheless, this only explains my first question if I understand correctly.

Question 5

On what system are you?

Question 6

Q: "C++ - Why isn't the unicode output correct?" A: "Because you used C++ and unicode in the same sentence" 😞

Question 7

g++ is using UTF-8 as its default execution charset. You can change it with -fexec-charset= but that means that your "é" in your first exemple is coded in UTF-8.
2.a There is no operator<< taking an ostream and a wchar_t. That means that the later is promoted and displayed as a number (wchar_t like char is an integral type).

The other are working as expected. I don't think more explanation is needed. Yet one thing to be aware of is that there is a need to have your environment correctly configured. That's why I asked you to pipe the output in | od -t x1 to check that the output was the expected one. As it is, the issue is a display issue and if you still had it, you'd have to check the configuration of your terminal emulator.

Question 8

const char arr[] = "россиянᔙинΩaऋ"; std::cout << arr << std::endl; Could you explain to me why the code above works and prints out correctly? This cannot be due to extended ascii like mentioned in one comment to the question as the chars do not belong to that ascii set. This char 'ऋ' for example takes 3 bytes in utf-8. How can that be saved in a char which is has a size of 1 byte? I printed strlen(arr) which puts out 27. Is it possible that a char like 'ऋ' simply gets allocated in three chars?

Question 9

@Spixmaster, yes that's what is happening. Gcc is using utf8 as encoding for narrow string literals.

Question 10

@Spixmaster const char arr[] = "россиянᔙинΩaऋ"; std::cout << arr << std::endl; works fine if 1) your source file is saved as UTF-8, 2) your compiler is set to interpret the source code as UTF-8, and outputs the string literal data in the final executable as UTF-8, and 3) your console supports the display of UTF-8 output from your executable. However, you should not rely on all of this being true. If you are using an up-to-date compiler (or at least a C++11 one), consider using the u8 or L prefix on string literals to handle Unicode strings, don't rely on the compiler's charset settings.

AProgrammer 52.7k8 gold badges96 silver badges149 bronze badges · Accepted Answer · 2019-12-09 20:51:03Z

2

g++ is using UTF-8 as its default execution charset. You can change it with -fexec-charset= but that means that your "é" in your first exemple is coded in UTF-8.
2.a There is no operator<< taking an ostream and a wchar_t. That means that the later is promoted and displayed as a number (wchar_t like char is an integral type).

The other are working as expected. I don't think more explanation is needed. Yet one thing to be aware of is that there is a need to have your environment correctly configured. That's why I asked you to pipe the output in | od -t x1 to check that the output was the expected one. As it is, the issue is a display issue and if you still had it, you'd have to check the configuration of your terminal emulator.

Share

Improve this answer

answered Dec 9, 2019 at 20:51

AProgrammer's user avatar

AProgrammer

52.7k8 gold badges96 silver badges149 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Spixmaster

Spixmaster Over a year ago

const char arr[] = "россиянᔙинΩaऋ"; std::cout << arr << std::endl; Could you explain to me why the code above works and prints out correctly? This cannot be due to extended ascii like mentioned in one comment to the question as the chars do not belong to that ascii set. This char 'ऋ' for example takes 3 bytes in utf-8. How can that be saved in a char which is has a size of 1 byte? I printed strlen(arr) which puts out 27. Is it possible that a char like 'ऋ' simply gets allocated in three chars?

2019年12月09日T21:19:21.83Z+00:00

AProgrammer

AProgrammer Over a year ago

@Spixmaster, yes that's what is happening. Gcc is using utf8 as encoding for narrow string literals.

2019年12月09日T21:28:28.713Z+00:00

Remy Lebeau

Remy Lebeau Over a year ago

@Spixmaster const char arr[] = "россиянᔙинΩaऋ"; std::cout << arr << std::endl; works fine if 1) your source file is saved as UTF-8, 2) your compiler is set to interpret the source code as UTF-8, and outputs the string literal data in the final executable as UTF-8, and 3) your console supports the display of UTF-8 output from your executable. However, you should not rely on all of this being true. If you are using an up-to-date compiler (or at least a C++11 one), consider using the u8 or L prefix on string literals to handle Unicode strings, don't rely on the compiler's charset settings.

2019年12月09日T22:04:33.417Z+00:00

CollectivesTM on Stack Overflow

C++ - Why isn't the unicode output correct?

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related