Converting a char string to wchar_t string based on a given toWideStr() starting point

Question 1

I'm working on a legacy code base and I came across a method in which I wanted to remove the chance of swallowing an exception. In the following I want to walk you through the refacoring process, because a code review of only the resulting code might not reveal problems in my thought process. If you want to review the final code only, you'll find it at the end.

For readability purposes,

I have merged the header and the implementation into one file, so that I don't need to post two sections of code.
I am only showing one method of the class instead of all methods. I am only refactoring this one method

The original code may have been written a long time ago, even not in C++, but it may have been written in Pascal, then ported to C and later compiled as C++. The code was last compiled with C++14 and "it worked", whatever that means :-). We now upgraded the toolset to C++20, so I can use C++20 features. But I may not be familiar with all C++20 features yet.

Goals of the review:

May I have introduced subtle bugs which I am not aware of?
Is the code "clean", i.e. readable?
Did I follow best practices and guidelines?

Non-goals:

Renaming of the class and of the method. If I change this, I have to change too many places in the rest of the application.
Making the static code a free function.
using <codecvt> because it might be removed in C++26 and I'm not sure whether it gives identical results.

Original code

#include <string>
#include "Windows.h"
class CDataConvert
{
public:
 static std::wstring toWideStr(const std::string& st);
};
std::wstring CDataConvert::toWideStr(const std::string& st)
{
 std::wstring wstr;
 const int iLen = int(st.length()) + 1;
 wchar_t *pwsz;
 if (iLen > 1)
 {
 try {
 pwsz = new wchar_t[iLen];
 }
 catch (std::bad_alloc a) {
 pwsz = nullptr;
 }
 if (pwsz)
 {
 MultiByteToWideChar(CP_ACP, 0, st.c_str(), iLen, pwsz, iLen);
 wstr = pwsz;
 delete[] pwsz;
 }
 }
 return wstr;
}

Fixing the swallowed exception

When I first saw the code, I wanted to get rid of the try/catch using new (std::nothrow), which would be a 1:1 replacement, IMHO. However, I thought about it. This problem should only happen in out-of-memory situations, in which case it doesn't make sense to let the application continue to the next method, where it will just run out of memory as well.

Changes to the code:

removed the exception handling
removed if(pwsz), because execution will not reach this line if new throws

#include <string>
#include "Windows.h"
class CDataConvert
{
public:
 static std::wstring toWideStr(const std::string& st);
};
std::wstring CDataConvert::toWideStr(const std::string& st)
{
 std::wstring wstr;
 const int iLen = int(st.length()) + 1;
 wchar_t *pwsz;
 if (iLen > 1)
 {
 pwsz = new wchar_t[iLen];
 MultiByteToWideChar(CP_ACP, 0, st.c_str(), iLen, pwsz, iLen);
 wstr = pwsz;
 delete[] pwsz;
 }
 return wstr;
}

First round of adapations

Unlike perhaps embedded code, I don't think we need single exit points in code that runs on PCs. We can easily put many breakpoints. With that, I can handle the special case for iLen <= 1 immediately and reduce the overall indentation.

Next, I want to initialize all variables, so they have defined values and it makes it harder to run into UB. I use uniform initialization, i.e. {}.

Since the return value is expected to be used, I put [[nodiscard]] on the method.

In Pascal, AFAIK, variables had to be declared in the beginning, and it was considered good practice in early versions of C or C++ due to bugs in compilers handling the stack, but those times should be over, so I move declarations closer to their usage. This even allows joining the declaration and initialization of pwsz.

#include <string>
#include "Windows.h"
class CDataConvert
{
public:
 static std::wstring toWideStr(const std::string& st);
};
[[nodiscard]] std::wstring CDataConvert::toWideStr(const std::string& st)
{
 std::wstring wstr{};
 const int iLen = int(st.length()) + 1;
 if (iLen <= 1) return wstr;
 wchar_t *pwsz = new wchar_t[iLen];
 MultiByteToWideChar(CP_ACP, 0, st.c_str(), iLen, pwsz, iLen);
 wstr = pwsz;
 delete[] pwsz;
 return wstr;
}

Second round of changes

wstr = pwsz should be the copy assignment operator. It can't be a pointer assignment, because the pointer is deleted later on. Since we have data() on strings, I thought I could simply make wstr long enough and then let MultiByteToWideChar copy the characters directly into wstr instead of using an intermediate buffer.

In this step I will also remove Hungarian notation. With more and more templates and the auto keyword, I think Hungarian notation plays a less and less important role.

#include <string>
#include "Windows.h"
class CDataConvert
{
public:
 static std::wstring toWideStr(const std::string& s);
};
[[nodiscard]] std::wstring CDataConvert::toWideStr(const std::string& s)
{
 std::wstring result{};
 const int len = int(s.length()) + 1;
 if (len <= 1) return result;
 result.resize(len);
 MultiByteToWideChar(CP_ACP, 0, s.c_str(), len, result.data(), len);
 return result;
}

Third round of changes

Since C++11, both, c_str() and data() return a null terminated string. The +1 in len was required in order to copy the terminating 0円-character. This was needed for the copy assignment operator to find the end of the string. In the changed version, resize() will already put a 0円 at the end and thus 0円 does not need to be copied.

So we can get rid of the +1, change the if-statement and inline len.

#include <string>
#include "Windows.h"
class CDataConvert
{
public:
 static std::wstring toWideStr(const std::string& s);
};
[[nodiscard]] std::wstring CDataConvert::toWideStr(const std::string& s)
{
 std::wstring result{};
 if (s.empty()) return result;
 result.resize(s.length());
 MultiByteToWideChar(CP_ACP, 0, s.c_str(), s.length(), result.data(), s.length());
 return result;
}

Final code

Since the above code gave me warning C4267, I re-introduced len as a variable.

#include <string>
#include "Windows.h"
class CDataConvert
{
public:
 static std::wstring toWideStr(const std::string& s);
};
[[nodiscard]] std::wstring CDataConvert::toWideStr(const std::string& s)
{
 std::wstring result{};
 if (s.empty()) return result;
 
 result.resize(s.length());
 const int len = static_cast<int>(s.length());
 MultiByteToWideChar(CP_ACP, 0, s.c_str(),len, result.data(), len);
 return result;
}

Question 2

Do you really need MultiByteToWideChar() (I'm guessing that's from "Windows.h")? Or can you use std::mbstowcs() or std::mbsrtowcs() so that your code is portable? I don't know the spec of your platform-specific function, but the name and usage suggests it's similar.

Question 3

@TobySpeight One complication is that a Windows "wide-character string" is a multi-byte encoding itself, UTF-16, even though this violates the standard. I’m not sure if mbstowcs necessarily handles that.

Question 4

Historical note: "In Pascal, AFAIK, variables had to be declared in the beginning, and it was considered good practice in early versions of C or C++ due to bugs in compilers handling the stack[,]" This was not because of bugs, but to make it possible to write a single-pass compiler for programs that would be fed into the computer on a deck of punch cards without storing the entire program in memory.

Question 5

This was a huge selling point for the Burroughs Algol compiler that added this restriction first. On its computers, running a second pass would require the operator to remove the punch cards from the output chute, reload them into the feeder, and run the program again immediately That was inconvenient enough already if you could do it yourself.. But many companies or schools would just have the secretary punch the program onto cards and run everyone’s decks of punch cards in order, so programmers often had to wait for the next day to get their punch cards run through a second time.

Question 6

That hardware was obsolete by the early ’70s, but still in use, especially in Europe. Edsgar Dijkstra was stuck using them and considered single-pass compilation a godsend.

Question 7

Consider using `std::string_view` for the input parameter

You are passing the input string as a const std::string&. However, this can be inefficient, for example it will cause a copy if the input is not actually already a std::string, for example when passing a C-string. Since C++17 you can use std::string_view instead, which will efficiently pass both C strings, std::strings and possibly other forms of strings.

Note however that there is no guarantee that a std::string_view has a NUL-terminator. This is not a problem for MultiByteToWideChar() though, as you pass it the length of the input string anyway.

The output might have a different number of characters than the input

Consider that a single UTF-8 character might be multiple chars big. That means that less wchar_ts are written to the output than there were chars in the input. However, you never check the return value of MultiByteToWideChar(), which should be the actual size of the result. Make sure you do, and resize the output before returning it.

Don't blindly cast a size to `int`

Sure, normally strings won't be so large that their length does not fit into an int, but if you don't handle this case correctly, you might leave a potential security hole in your program. Either throw an exception if the input is too large for a single call to MultiByteToWideChar(), or better, call MultiByteToWideChar() multiple times if necessary.

Consider using `std::codecvt` or `std::mbstowcs()`

Non-goals: [...]

using <codecvt> because it might be removed in C++26 and I'm not sure whether it gives identical results.

Indeed, the header file <codecvt> has been deprecated, but std::codecvt from <locale> has not. It should give identical results, and should be more portable, but you would have to test that.

However, using std::codecvt directly doesn't look very nice. But you want to convert the input specifically to a std::wstring, and that can also be done using std::mbstowcs(). Note that this function does expect the input to be NUL-terminated; it has no parameter for the size of the input string. The code could then look like:

[[nodiscard]] std::wstring CDataConvert::toWideStr(const std::string& s)
{
 std::wstring result{};
 result.resize(s.size());
 auto new_size = std::mbstowcs(result.data(), s.data(), s.size());
 if (new_size == static_cast<std::size_t>(-1)) {
 throw std::runtime_error("Conversion error");
 }
 result.resize(new_size);
 return result;
}

Question 8

The empty() case is to avoid the special case of "Note that, if cbMultiByte is 0, the function fails." [source]

Question 9

Thanks for mentioning a lot of stuff I didn't know yet.

Question 10

Oof, I did not know it didn't handle empty strings. How annoying!

Question 11

The wide-character string could, in theory, have more elements than the input string. 1. If the current ANSI code page maps an input character to a UTF-16 surrogate pair (not sure if that’s realistically possible), 2: If an character is mapped to a decomposed grapheme (although the original too the default option to always use precomposed characters, and I believe every Windows code page has precomposed mappings).

G. Sliepen G. Sliepen 68.7k3 gold badges74 silver badges179 bronze badges · Answer 1 · 2024-10-20 21:15:34Z

Consider using `std::string_view` for the input parameter

You are passing the input string as a const std::string&. However, this can be inefficient, for example it will cause a copy if the input is not actually already a std::string, for example when passing a C-string. Since C++17 you can use std::string_view instead, which will efficiently pass both C strings, std::strings and possibly other forms of strings.

Note however that there is no guarantee that a std::string_view has a NUL-terminator. This is not a problem for MultiByteToWideChar() though, as you pass it the length of the input string anyway.

The output might have a different number of characters than the input

Consider that a single UTF-8 character might be multiple chars big. That means that less wchar_ts are written to the output than there were chars in the input. However, you never check the return value of MultiByteToWideChar(), which should be the actual size of the result. Make sure you do, and resize the output before returning it.

Don't blindly cast a size to `int`

Sure, normally strings won't be so large that their length does not fit into an int, but if you don't handle this case correctly, you might leave a potential security hole in your program. Either throw an exception if the input is too large for a single call to MultiByteToWideChar(), or better, call MultiByteToWideChar() multiple times if necessary.

Consider using `std::codecvt` or `std::mbstowcs()`

Non-goals: [...]

using <codecvt> because it might be removed in C++26 and I'm not sure whether it gives identical results.

Indeed, the header file <codecvt> has been deprecated, but std::codecvt from <locale> has not. It should give identical results, and should be more portable, but you would have to test that.

However, using std::codecvt directly doesn't look very nice. But you want to convert the input specifically to a std::wstring, and that can also be done using std::mbstowcs(). Note that this function does expect the input to be NUL-terminated; it has no parameter for the size of the input string. The code could then look like:

[[nodiscard]] std::wstring CDataConvert::toWideStr(const std::string& s)
{
 std::wstring result{};
 result.resize(s.size());
 auto new_size = std::mbstowcs(result.data(), s.data(), s.size());
 if (new_size == static_cast<std::size_t>(-1)) {
 throw std::runtime_error("Conversion error");
 }
 result.resize(new_size);
 return result;
}

The empty() case is to avoid the special case of "Note that, if cbMultiByte is 0, the function fails." [source]
Oof, I did not know it didn't handle empty strings. How annoying!
The wide-character string could, in theory, have more elements than the input string. 1. If the current ANSI code page maps an input character to a UTF-16 surrogate pair (not sure if that’s realistically possible), 2: If an character is mapped to a decomposed grapheme (although the original too the default option to always use precomposed characters, and I believe every Windows code page has precomposed mappings).

Stack Exchange Network

Converting a char string to wchar_t string based on a given toWideStr() starting point

Original code

Fixing the swallowed exception

First round of adapations

Second round of changes

Third round of changes

Final code

1 Answer 1

Consider using `std::string_view` for the input parameter

The output might have a different number of characters than the input

Don't blindly cast a size to `int`

Consider using `std::codecvt` or `std::mbstowcs()`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Converting a char string to wchar_t string based on a given toWideStr() starting point

Original code

Fixing the swallowed exception

First round of adapations

Second round of changes

Third round of changes

Final code

1 Answer 1

Consider using std::string_view for the input parameter

The output might have a different number of characters than the input

Don't blindly cast a size to int

Consider using std::codecvt or std::mbstowcs()

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Consider using `std::string_view` for the input parameter

Don't blindly cast a size to `int`

Consider using `std::codecvt` or `std::mbstowcs()`