I'm working on a legacy code base and I came across a method in which I wanted to remove the chance of swallowing an exception. In the following I want to walk you through the refacoring process, because a code review of only the resulting code might not reveal problems in my thought process. If you want to review the final code only, you'll find it at the end.
For readability purposes,
- I have merged the header and the implementation into one file, so that I don't need to post two sections of code.
- I am only showing one method of the class instead of all methods. I am only refactoring this one method
The original code may have been written a long time ago, even not in C++, but it may have been written in Pascal, then ported to C and later compiled as C++. The code was last compiled with C++14 and "it worked", whatever that means :-). We now upgraded the toolset to C++20, so I can use C++20 features. But I may not be familiar with all C++20 features yet.
Goals of the review:
- May I have introduced subtle bugs which I am not aware of?
- Is the code "clean", i.e. readable?
- Did I follow best practices and guidelines?
Non-goals:
- Renaming of the class and of the method. If I change this, I have to change too many places in the rest of the application.
- Making the static code a free function.
- using
<codecvt>
because it might be removed in C++26 and I'm not sure whether it gives identical results.
Original code
#include <string>
#include "Windows.h"
class CDataConvert
{
public:
static std::wstring toWideStr(const std::string& st);
};
std::wstring CDataConvert::toWideStr(const std::string& st)
{
std::wstring wstr;
const int iLen = int(st.length()) + 1;
wchar_t *pwsz;
if (iLen > 1)
{
try {
pwsz = new wchar_t[iLen];
}
catch (std::bad_alloc a) {
pwsz = nullptr;
}
if (pwsz)
{
MultiByteToWideChar(CP_ACP, 0, st.c_str(), iLen, pwsz, iLen);
wstr = pwsz;
delete[] pwsz;
}
}
return wstr;
}
Fixing the swallowed exception
When I first saw the code, I wanted to get rid of the try/catch
using new (std::nothrow)
, which would be a 1:1 replacement, IMHO. However, I thought about it. This problem should only happen in out-of-memory situations, in which case it doesn't make sense to let the application continue to the next method, where it will just run out of memory as well.
Changes to the code:
- removed the exception handling
- removed
if(pwsz)
, because execution will not reach this line ifnew
throws
#include <string>
#include "Windows.h"
class CDataConvert
{
public:
static std::wstring toWideStr(const std::string& st);
};
std::wstring CDataConvert::toWideStr(const std::string& st)
{
std::wstring wstr;
const int iLen = int(st.length()) + 1;
wchar_t *pwsz;
if (iLen > 1)
{
pwsz = new wchar_t[iLen];
MultiByteToWideChar(CP_ACP, 0, st.c_str(), iLen, pwsz, iLen);
wstr = pwsz;
delete[] pwsz;
}
return wstr;
}
First round of adapations
Unlike perhaps embedded code, I don't think we need single exit points in code that runs on PCs. We can easily put many breakpoints. With that, I can handle the special case for iLen <= 1
immediately and reduce the overall indentation.
Next, I want to initialize all variables, so they have defined values and it makes it harder to run into UB. I use uniform initialization, i.e. {}
.
Since the return value is expected to be used, I put [[nodiscard]]
on the method.
In Pascal, AFAIK, variables had to be declared in the beginning, and it was considered good practice in early versions of C or C++ due to bugs in compilers handling the stack, but those times should be over, so I move declarations closer to their usage. This even allows joining the declaration and initialization of pwsz
.
#include <string>
#include "Windows.h"
class CDataConvert
{
public:
static std::wstring toWideStr(const std::string& st);
};
[[nodiscard]] std::wstring CDataConvert::toWideStr(const std::string& st)
{
std::wstring wstr{};
const int iLen = int(st.length()) + 1;
if (iLen <= 1) return wstr;
wchar_t *pwsz = new wchar_t[iLen];
MultiByteToWideChar(CP_ACP, 0, st.c_str(), iLen, pwsz, iLen);
wstr = pwsz;
delete[] pwsz;
return wstr;
}
Second round of changes
wstr = pwsz
should be the copy assignment operator. It can't be a pointer assignment, because the pointer is deleted later on. Since we have data()
on strings, I thought I could simply make wstr
long enough and then let MultiByteToWideChar
copy the characters directly into wstr
instead of using an intermediate buffer.
In this step I will also remove Hungarian notation. With more and more templates and the auto
keyword, I think Hungarian notation plays a less and less important role.
#include <string>
#include "Windows.h"
class CDataConvert
{
public:
static std::wstring toWideStr(const std::string& s);
};
[[nodiscard]] std::wstring CDataConvert::toWideStr(const std::string& s)
{
std::wstring result{};
const int len = int(s.length()) + 1;
if (len <= 1) return result;
result.resize(len);
MultiByteToWideChar(CP_ACP, 0, s.c_str(), len, result.data(), len);
return result;
}
Third round of changes
Since C++11, both, c_str()
and data()
return a null terminated string. The +1
in len
was required in order to copy the terminating 0円
-character. This was needed for the copy assignment operator to find the end of the string. In the changed version, resize()
will already put a 0円
at the end and thus 0円
does not need to be copied.
So we can get rid of the +1
, change the if
-statement and inline len
.
#include <string>
#include "Windows.h"
class CDataConvert
{
public:
static std::wstring toWideStr(const std::string& s);
};
[[nodiscard]] std::wstring CDataConvert::toWideStr(const std::string& s)
{
std::wstring result{};
if (s.empty()) return result;
result.resize(s.length());
MultiByteToWideChar(CP_ACP, 0, s.c_str(), s.length(), result.data(), s.length());
return result;
}
Final code
Since the above code gave me warning C4267, I re-introduced len
as a variable.
#include <string>
#include "Windows.h"
class CDataConvert
{
public:
static std::wstring toWideStr(const std::string& s);
};
[[nodiscard]] std::wstring CDataConvert::toWideStr(const std::string& s)
{
std::wstring result{};
if (s.empty()) return result;
result.resize(s.length());
const int len = static_cast<int>(s.length());
MultiByteToWideChar(CP_ACP, 0, s.c_str(),len, result.data(), len);
return result;
}
1 Answer 1
Consider using std::string_view
for the input parameter
You are passing the input string as a const std::string&
. However, this can be inefficient, for example it will cause a copy if the input is not actually already a std::string
, for example when passing a C-string. Since C++17 you can use std::string_view
instead, which will efficiently pass both C strings, std::string
s and possibly other forms of strings.
Note however that there is no guarantee that a std::string_view
has a NUL-terminator. This is not a problem for MultiByteToWideChar()
though, as you pass it the length of the input string anyway.
The output might have a different number of characters than the input
Consider that a single UTF-8 character might be multiple char
s big. That means that less wchar_t
s are written to the output than there were char
s in the input. However, you never check the return value of MultiByteToWideChar()
, which should be the actual size of the result. Make sure you do, and resize the output before returning it.
Don't blindly cast a size to int
Sure, normally strings won't be so large that their length does not fit into an int
, but if you don't handle this case correctly, you might leave a potential security hole in your program. Either throw
an exception if the input is too large for a single call to MultiByteToWideChar()
, or better, call MultiByteToWideChar()
multiple times if necessary.
Consider using std::codecvt
or std::mbstowcs()
Non-goals: [...]
- using
<codecvt>
because it might be removed in C++26 and I'm not sure whether it gives identical results.
Indeed, the header file <codecvt>
has been deprecated, but std::codecvt
from <locale>
has not. It should give identical results, and should be more portable, but you would have to test that.
However, using std::codecvt
directly doesn't look very nice. But you want to convert the input specifically to a std::wstring
, and that can also be done using std::mbstowcs()
. Note that this function does expect the input to be NUL-terminated; it has no parameter for the size of the input string. The code could then look like:
[[nodiscard]] std::wstring CDataConvert::toWideStr(const std::string& s)
{
std::wstring result{};
result.resize(s.size());
auto new_size = std::mbstowcs(result.data(), s.data(), s.size());
if (new_size == static_cast<std::size_t>(-1)) {
throw std::runtime_error("Conversion error");
}
result.resize(new_size);
return result;
}
-
1\$\begingroup\$ The
empty()
case is to avoid the special case of "Note that, if cbMultiByte is 0, the function fails." [source] \$\endgroup\$Thomas Weller– Thomas Weller2024年10月20日 21:18:20 +00:00Commented Oct 20, 2024 at 21:18 -
\$\begingroup\$ Thanks for mentioning a lot of stuff I didn't know yet. \$\endgroup\$Thomas Weller– Thomas Weller2024年10月20日 21:21:37 +00:00Commented Oct 20, 2024 at 21:21
-
\$\begingroup\$ Oof, I did not know it didn't handle empty strings. How annoying! \$\endgroup\$G. Sliepen– G. Sliepen2024年10月20日 21:22:42 +00:00Commented Oct 20, 2024 at 21:22
-
1\$\begingroup\$ The wide-character string could, in theory, have more elements than the input string. 1. If the current ANSI code page maps an input character to a UTF-16 surrogate pair (not sure if that’s realistically possible), 2: If an character is mapped to a decomposed grapheme (although the original too the default option to always use precomposed characters, and I believe every Windows code page has precomposed mappings). \$\endgroup\$Davislor– Davislor2024年10月21日 07:34:02 +00:00Commented Oct 21, 2024 at 7:34
MultiByteToWideChar()
(I'm guessing that's from"Windows.h"
)? Or can you usestd::mbstowcs()
orstd::mbsrtowcs()
so that your code is portable? I don't know the spec of your platform-specific function, but the name and usage suggests it's similar. \$\endgroup\$mbstowcs
necessarily handles that. \$\endgroup\$