While researching ways to convert back and forth between std::wstring
and std::string
, I found this conversation on the MSDN forums.
There were two functions that, to me, looked good. Specifically, these:
std::wstring s2ws(const std::string& s)
{
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
wchar_t* buf = new wchar_t[len];
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
std::wstring r(buf);
delete[] buf;
return r;
}
std::string ws2s(const std::wstring& s)
{
int len;
int slength = (int)s.length() + 1;
len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0);
char* buf = new char[len];
WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, buf, len, 0, 0);
std::string r(buf);
delete[] buf;
return r;
}
However, the double allocation and the need to delete the buffer concern me (performance and exception safety) so I modified them to be like this:
std::wstring s2ws(const std::string& s)
{
int len;
int slength = (int)s.length() + 1;
len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
std::wstring r(len, L'0円');
MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, &r[0], len);
return r;
}
std::string ws2s(const std::wstring& s)
{
int len;
int slength = (int)s.length() + 1;
len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0);
std::string r(len, '0円');
WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, &r[0], len, 0, 0);
return r;
}
Unit testing indicates that this works in a nice, controlled environment but will this be OK in the vicious and unpredictable world that is my client's computer?
9 Answers 9
I would, and have, redesign your set of functions to resemble casts:
std::wstring x;
std::string y = string_cast<std::string>(x);
This can have a lot of benefits later when you start having to deal with some 3rd party library's idea of what strings should look like.
-
2\$\begingroup\$ I love the syntax. Can you share the code? \$\endgroup\$Jere.Jones– Jere.Jones2011年01月31日 18:53:42 +00:00Commented Jan 31, 2011 at 18:53
-
1\$\begingroup\$ Oooh. That looks nice. How would one do that? Just make a template with specializations to convert between the various string types? \$\endgroup\$Billy ONeal– Billy ONeal2011年02月05日 18:21:41 +00:00Commented Feb 5, 2011 at 18:21
-
2
-
\$\begingroup\$ Why, though? I think I prefer to have functions
to_string
andto_wstring
, similar to the standard library (in my own namespace of course). \$\endgroup\$Marc.2377– Marc.23772019年09月13日 23:55:53 +00:00Commented Sep 13, 2019 at 23:55
Actually my unit testing shows that your code is wrong!
The problem is that you include the zero terminator in the output string, which is not supposed to happen with std::string
and friends. Here's an example why this can lead to problems, especially if you use std::string::compare
:
// Allocate string with 5 characters (including the zero terminator as in your code!)
string s(5, '_');
memcpy(&s[0], "ABCD0円", 5);
// Comparing with strcmp is all fine since it only compares until the terminator
const int cmp1 = strcmp(s.c_str(), "ABCD"); // 0
// ...however the number of characters that std::string::compare compares is
// someString.size(), and since s.size() == 5, it is obviously not equal to "ABCD"!
const int cmp2 = s.compare("ABCD"); // 1
// And just to prove that string implementations automatically add a zero terminator
// if you call .c_str()
s.resize(3);
const int cmp3 = strcmp(s.c_str(), "ABC"); // 0
const char term = s.c_str()[3]; // 0
printf("cmp1=%d, cmp2=%d, cmp3=%d, terminator=%d\n", cmp1, cmp2, cmp3, (int)term);
-
\$\begingroup\$ I found the addition of the terminator annoying too: it also broke a string addition in my case. I ended up adding the boolean parameter
includeTerminator
to both methods. \$\endgroup\$reallynice– reallynice2015年07月30日 13:01:45 +00:00Commented Jul 30, 2015 at 13:01 -
\$\begingroup\$ @reallynice see: FlagArgument (martinfowler.com) \$\endgroup\$Marc.2377– Marc.23772019年09月12日 00:53:12 +00:00Commented Sep 12, 2019 at 0:53
It really depends what codecs are being used with std::wstring
and std::string
.
This answer assumes that the std::wstring
is using a UTF-16 encoding, and that the conversion to std::string
will use a UTF-8 encoding.
#include <codecvt>
#include <string>
std::wstring utf8ToUtf16(const std::string& utf8Str)
{
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.from_bytes(utf8Str);
}
std::string utf16ToUtf8(const std::wstring& utf16Str)
{
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
return conv.to_bytes(utf16Str);
}
This answer uses the STL and does not rely on a platform specific library.
-
\$\begingroup\$ This is the best answer because is the only working. Thank you. \$\endgroup\$Roman– Roman2017年12月24日 09:02:49 +00:00Commented Dec 24, 2017 at 9:02
-
\$\begingroup\$ dbj.org/c17-codecvt-deprecated-panic ... my shameless plug might help ... \$\endgroup\$DBJDBJ– DBJDBJ2019年07月15日 15:48:07 +00:00Commented Jul 15, 2019 at 15:48
-
\$\begingroup\$ @DBJDBJ Your proposed solution is by no means a replacement to
wstring_convert
from <codecvt>. You kind of down-play the problem by saying the approach is "somewhat controversial" - in my opinion, it's much more than that. It's wrong. I have uses ofwstring_convert
which your solution cannot replace. It cannot convert true unicode strings like"おはよう"
, as it's not a true conversion; it's a cast. Consider making that more explicit in your text ;) \$\endgroup\$Marc.2377– Marc.23772019年09月12日 00:59:34 +00:00Commented Sep 12, 2019 at 0:59 -
\$\begingroup\$ @Marc.2377 -- well ,I think I have placed plenty of warnings in that text. I even have mentioned the "casting" word. there is even a link to "that article" ... Many thanks for reading in any case. \$\endgroup\$DBJDBJ– DBJDBJ2019年09月12日 15:17:45 +00:00Commented Sep 12, 2019 at 15:17
-
\$\begingroup\$ FFWD to 2019 --
<codecvt>
is deprecated \$\endgroup\$DBJDBJ– DBJDBJ2019年09月12日 16:15:12 +00:00Commented Sep 12, 2019 at 16:15
One thing that may be an issue is that it assumes the string is ANSI formatted using the currently active code page (CP_ACP). You might want to consider using a specific code page or CP_UTF8 if it's UTF-8.
-
\$\begingroup\$ This may be a silly question but, how can I tell? For my usage these will typically be filenames. \$\endgroup\$Jere.Jones– Jere.Jones2011年01月31日 19:49:27 +00:00Commented Jan 31, 2011 at 19:49
-
\$\begingroup\$ How do you obtain the filenames? That will determine the correct code page to use. \$\endgroup\$Ferruccio– Ferruccio2011年02月01日 01:30:57 +00:00Commented Feb 1, 2011 at 1:30
-
\$\begingroup\$ @Jere.Jones: One way is to check if the string is valid UTF-8. If not, assume it's ANSI. \$\endgroup\$dan04– dan042011年02月05日 16:31:30 +00:00Commented Feb 5, 2011 at 16:31
-
2\$\begingroup\$ @dan04: ANSI requires that a code page is specified. en.wikipedia.org/wiki/Code_page. \$\endgroup\$Ferruccio– Ferruccio2011年02月05日 21:33:04 +00:00Commented Feb 5, 2011 at 21:33
-
2\$\begingroup\$ Further note: The MSDN documentation recommends not using CP_ACP for strings intended for permanent storage, because the active page may be changed at any time \$\endgroup\$M.M– M.M2018年02月20日 20:46:58 +00:00Commented Feb 20, 2018 at 20:46
I'd recommend changing this:
int len;
int slength = (int)s.length() + 1;
len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0);
...to this:
int slength = (int)s.length() + 1;
int len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0);
Slightly more concise, len
's scope is reduced, and you don't have an uninitialised variable floating round (ok, just for one line) as a trap for the unwary.
I don't do any Windows development, so I can't comment on the WideCharToMultiByte
part being safe.
The one thing I would say though is to ensure you are using the proper types for everything. For example, string.length()
returns a std::string::size_type
(most likely a size_t
, the constructor also takes a std::string::size_type
, but that one isn't as big of a deal). It probably won't ever bite you, but it is something to be careful of to ensure you don't have any overflows in other code you may be writing.
-
1\$\begingroup\$ Well, it returns a
std::string::size_type
. \$\endgroup\$Jon Purdy– Jon Purdy2011年01月30日 09:15:19 +00:00Commented Jan 30, 2011 at 9:15 -
\$\begingroup\$ @Jon: True, but I've never it seen it not be equal to the representation of a
size_t
. I'll modify the answer though, thanks for your feedback. \$\endgroup\$Mark Loeser– Mark Loeser2011年01月30日 15:41:28 +00:00Commented Jan 30, 2011 at 15:41 -
2\$\begingroup\$ @Jon:
std::string::size_type
is always astd::size_t
. \$\endgroup\$GManNickG– GManNickG2011年01月30日 22:02:46 +00:00Commented Jan 30, 2011 at 22:02 -
\$\begingroup\$ @GMan: I was just being pedantic out of boredom. SGI says it's "an unsigned integral type that can represent any nonnegative value of the container's distance type"—that is,
difference_type
—and that both of these must betypedef
s for existing types, but this doesn't imply thatsize_type
has to be equivalent tosize_t
. Is there something else at work here? \$\endgroup\$Jon Purdy– Jon Purdy2011年01月30日 23:28:42 +00:00Commented Jan 30, 2011 at 23:28 -
1\$\begingroup\$ @Jon: I'm not sure why SGI matters. The standard says
std::string::size_type
isallocator_type::size_type
, and the default allocator'ssize_type
isstd::size_t
. \$\endgroup\$GManNickG– GManNickG2011年01月31日 05:31:34 +00:00Commented Jan 31, 2011 at 5:31
When we're using std::string
and std::wstring
, we need #include <string>
.
There's no declaration of MultiByteToWideChar()
or WideCharToMultiByte()
. Their names suggest they might be some thin wrapper around std::mbstowcs()
and std::wcstombs()
respectively, but without seeing them, it's hard to be sure.
It would be much simpler and easier to understand if we used the standard functions to convert between null-terminated multibyte and wide-character strings.
Depending on the use case, a std::codecvt<wchar_t, char, std::mbstate_t>
facet might be more appropriate. Then there's very little code to be written, and in particular, no need to guess at the possible output length.
I've only briefly looked over your code. I haven't worked with std::string much but I've worked a lot with the API.
Assuming you got all your lengths and arguments right (sometimes making sure the terminator and wide vs multibyte lengths are all right can be tricky), I think you're on the right track. I think the first routines you posted unnecessarily allocate an additional buffer. It isn't needed.
No, this is dangerous! The characters in a std::string may not be stored in a contiguous memory block and you must not use the pointer &r[0]
to write to any characters other than that character! This is why the c_str()
function returns a const
pointer.
It might work with MSVC, but it will probably break if you switch to a different compiler or STL library.
-
4\$\begingroup\$ -1: Wrong: stackoverflow.com/questions/2256160/… \$\endgroup\$Billy ONeal– Billy ONeal2011年11月05日 13:59:20 +00:00Commented Nov 5, 2011 at 13:59