Converting between std::wstring and std::string

Question 1

While researching ways to convert back and forth between std::wstring and std::string, I found this conversation on the MSDN forums.

There were two functions that, to me, looked good. Specifically, these:

std::wstring s2ws(const std::string& s)
{
 int len;
 int slength = (int)s.length() + 1;
 len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0); 
 wchar_t* buf = new wchar_t[len];
 MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
 std::wstring r(buf);
 delete[] buf;
 return r;
}
std::string ws2s(const std::wstring& s)
{
 int len;
 int slength = (int)s.length() + 1;
 len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0); 
 char* buf = new char[len];
 WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, buf, len, 0, 0); 
 std::string r(buf);
 delete[] buf;
 return r;
}

However, the double allocation and the need to delete the buffer concern me (performance and exception safety) so I modified them to be like this:

std::wstring s2ws(const std::string& s)
{
 int len;
 int slength = (int)s.length() + 1;
 len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0); 
 std::wstring r(len, L'0円');
 MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, &r[0], len);
 return r;
}
std::string ws2s(const std::wstring& s)
{
 int len;
 int slength = (int)s.length() + 1;
 len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0); 
 std::string r(len, '0円');
 WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, &r[0], len, 0, 0); 
 return r;
}

Unit testing indicates that this works in a nice, controlled environment but will this be OK in the vicious and unpredictable world that is my client's computer?

Question 2

I would, and have, redesign your set of functions to resemble casts:

std::wstring x;
std::string y = string_cast<std::string>(x);

This can have a lot of benefits later when you start having to deal with some 3rd party library's idea of what strings should look like.

Question 3

I love the syntax. Can you share the code?

Question 4

Oooh. That looks nice. How would one do that? Just make a template with specializations to convert between the various string types?

Question 5

@Billy someone posted a codereview question for string_cast implementation on here if you guys are interested.

Question 6

Why, though? I think I prefer to have functions to_string and to_wstring, similar to the standard library (in my own namespace of course).

Question 7

Actually my unit testing shows that your code is wrong!

The problem is that you include the zero terminator in the output string, which is not supposed to happen with std::string and friends. Here's an example why this can lead to problems, especially if you use std::string::compare:

// Allocate string with 5 characters (including the zero terminator as in your code!)
string s(5, '_');
memcpy(&s[0], "ABCD0円", 5);
// Comparing with strcmp is all fine since it only compares until the terminator
const int cmp1 = strcmp(s.c_str(), "ABCD"); // 0
// ...however the number of characters that std::string::compare compares is
// someString.size(), and since s.size() == 5, it is obviously not equal to "ABCD"!
const int cmp2 = s.compare("ABCD"); // 1
// And just to prove that string implementations automatically add a zero terminator
// if you call .c_str()
s.resize(3);
const int cmp3 = strcmp(s.c_str(), "ABC"); // 0
const char term = s.c_str()[3]; // 0
printf("cmp1=%d, cmp2=%d, cmp3=%d, terminator=%d\n", cmp1, cmp2, cmp3, (int)term);

Question 8

I found the addition of the terminator annoying too: it also broke a string addition in my case. I ended up adding the boolean parameter includeTerminator to both methods.

Question 9

@reallynice see: FlagArgument (martinfowler.com)

Question 10

It really depends what codecs are being used with std::wstring and std::string.

This answer assumes that the std::wstring is using a UTF-16 encoding, and that the conversion to std::string will use a UTF-8 encoding.

#include <codecvt>
#include <string>
std::wstring utf8ToUtf16(const std::string& utf8Str)
{
 std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
 return conv.from_bytes(utf8Str);
}
std::string utf16ToUtf8(const std::wstring& utf16Str)
{
 std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
 return conv.to_bytes(utf16Str);
}

This answer uses the STL and does not rely on a platform specific library.

Question 11

This is the best answer because is the only working. Thank you.

Question 12

dbj.org/c17-codecvt-deprecated-panic ... my shameless plug might help ...

Question 13

@DBJDBJ Your proposed solution is by no means a replacement to wstring_convert from <codecvt>. You kind of down-play the problem by saying the approach is "somewhat controversial" - in my opinion, it's much more than that. It's wrong. I have uses of wstring_convert which your solution cannot replace. It cannot convert true unicode strings like "おはよう", as it's not a true conversion; it's a cast. Consider making that more explicit in your text ;)

Question 14

@Marc.2377 -- well ,I think I have placed plenty of warnings in that text. I even have mentioned the "casting" word. there is even a link to "that article" ... Many thanks for reading in any case.

Question 15

FFWD to 2019 -- <codecvt> is deprecated

Question 16

One thing that may be an issue is that it assumes the string is ANSI formatted using the currently active code page (CP_ACP). You might want to consider using a specific code page or CP_UTF8 if it's UTF-8.

Question 17

This may be a silly question but, how can I tell? For my usage these will typically be filenames.

Question 18

How do you obtain the filenames? That will determine the correct code page to use.

Question 19

@Jere.Jones: One way is to check if the string is valid UTF-8. If not, assume it's ANSI.

Question 20

@dan04: ANSI requires that a code page is specified. en.wikipedia.org/wiki/Code_page.

Question 21

Further note: The MSDN documentation recommends not using CP_ACP for strings intended for permanent storage, because the active page may be changed at any time

Question 22

I'd recommend changing this:

int len;
int slength = (int)s.length() + 1;
len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0);

...to this:

int slength = (int)s.length() + 1;
int len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0);

Slightly more concise, len's scope is reduced, and you don't have an uninitialised variable floating round (ok, just for one line) as a trap for the unwary.

Question 23

I don't do any Windows development, so I can't comment on the WideCharToMultiByte part being safe.

The one thing I would say though is to ensure you are using the proper types for everything. For example, string.length() returns a std::string::size_type (most likely a size_t, the constructor also takes a std::string::size_type, but that one isn't as big of a deal). It probably won't ever bite you, but it is something to be careful of to ensure you don't have any overflows in other code you may be writing.

Question 24

Well, it returns a std::string::size_type.

Question 25

@Jon: True, but I've never it seen it not be equal to the representation of a size_t. I'll modify the answer though, thanks for your feedback.

Question 26

@Jon: std::string::size_type is always a std::size_t.

Question 27

@GMan: I was just being pedantic out of boredom. SGI says it's "an unsigned integral type that can represent any nonnegative value of the container's distance type"—that is, difference_type—and that both of these must be typedef s for existing types, but this doesn't imply that size_type has to be equivalent to size_t. Is there something else at work here?

Question 28

@Jon: I'm not sure why SGI matters. The standard says std::string::size_type is allocator_type::size_type, and the default allocator's size_type is std::size_t.

Question 29

When we're using std::string and std::wstring, we need #include <string>.

There's no declaration of MultiByteToWideChar() or WideCharToMultiByte(). Their names suggest they might be some thin wrapper around std::mbstowcs() and std::wcstombs() respectively, but without seeing them, it's hard to be sure.

It would be much simpler and easier to understand if we used the standard functions to convert between null-terminated multibyte and wide-character strings.

Depending on the use case, a std::codecvt<wchar_t, char, std::mbstate_t> facet might be more appropriate. Then there's very little code to be written, and in particular, no need to guess at the possible output length.

Question 30

I've only briefly looked over your code. I haven't worked with std::string much but I've worked a lot with the API.

Assuming you got all your lengths and arguments right (sometimes making sure the terminator and wide vs multibyte lengths are all right can be tricky), I think you're on the right track. I think the first routines you posted unnecessarily allocate an additional buffer. It isn't needed.

Question 31

No, this is dangerous! The characters in a std::string may not be stored in a contiguous memory block and you must not use the pointer &r[0] to write to any characters other than that character! This is why the c_str() function returns a const pointer.

It might work with MSVC, but it will probably break if you switch to a different compiler or STL library.

Question 32

-1: Wrong: stackoverflow.com/questions/2256160/…

anonanon · Answer 1 · 2011-01-31 18:17:21Z

13

\$\begingroup\$

I would, and have, redesign your set of functions to resemble casts:

std::wstring x;
std::string y = string_cast<std::string>(x);

This can have a lot of benefits later when you start having to deal with some 3rd party library's idea of what strings should look like.

Share

answered Jan 31, 2011 at 18:17

anonanon

\$\endgroup\$

4

2

\$\begingroup\$ I love the syntax. Can you share the code? \$\endgroup\$

Jere.Jones
– Jere.Jones

2011年01月31日 18:53:42 +00:00
Commented Jan 31, 2011 at 18:53
1

\$\begingroup\$ Oooh. That looks nice. How would one do that? Just make a template with specializations to convert between the various string types? \$\endgroup\$

Billy ONeal
– Billy ONeal

2011年02月05日 18:21:41 +00:00
Commented Feb 5, 2011 at 18:21
2

\$\begingroup\$ @Billy someone posted a codereview question for string_cast implementation on here if you guys are interested. \$\endgroup\$

greatwolf
– greatwolf

2011年11月05日 03:20:49 +00:00
Commented Nov 5, 2011 at 3:20
\$\begingroup\$ Why, though? I think I prefer to have functions to_string and to_wstring, similar to the standard library (in my own namespace of course). \$\endgroup\$

Marc.2377
– Marc.2377

2019年09月13日 23:55:53 +00:00
Commented Sep 13, 2019 at 23:55

Add a comment |

AndiDog AndiDog 2212 silver badges3 bronze badges · Answer 2 · 2013-03-03 17:35:18Z

Actually my unit testing shows that your code is wrong!

The problem is that you include the zero terminator in the output string, which is not supposed to happen with std::string and friends. Here's an example why this can lead to problems, especially if you use std::string::compare:

// Allocate string with 5 characters (including the zero terminator as in your code!)
string s(5, '_');
memcpy(&s[0], "ABCD0円", 5);
// Comparing with strcmp is all fine since it only compares until the terminator
const int cmp1 = strcmp(s.c_str(), "ABCD"); // 0
// ...however the number of characters that std::string::compare compares is
// someString.size(), and since s.size() == 5, it is obviously not equal to "ABCD"!
const int cmp2 = s.compare("ABCD"); // 1
// And just to prove that string implementations automatically add a zero terminator
// if you call .c_str()
s.resize(3);
const int cmp3 = strcmp(s.c_str(), "ABC"); // 0
const char term = s.c_str()[3]; // 0
printf("cmp1=%d, cmp2=%d, cmp3=%d, terminator=%d\n", cmp1, cmp2, cmp3, (int)term);

I found the addition of the terminator annoying too: it also broke a string addition in my case. I ended up adding the boolean parameter includeTerminator to both methods.

Jamerson Jamerson 3033 silver badges11 bronze badges · Answer 3 · 2016-11-11 07:14:52Z

12

\$\begingroup\$

It really depends what codecs are being used with std::wstring and std::string.

This answer assumes that the std::wstring is using a UTF-16 encoding, and that the conversion to std::string will use a UTF-8 encoding.

#include <codecvt>
#include <string>
std::wstring utf8ToUtf16(const std::string& utf8Str)
{
 std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
 return conv.from_bytes(utf8Str);
}
std::string utf16ToUtf8(const std::wstring& utf16Str)
{
 std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> conv;
 return conv.to_bytes(utf16Str);
}

This answer uses the STL and does not rely on a platform specific library.

Share

edited Nov 13, 2016 at 21:45

answered Nov 11, 2016 at 7:14

Jamerson's user avatar

Jamerson Jamerson

3033 silver badges11 bronze badges

\$\endgroup\$

5

\$\begingroup\$ This is the best answer because is the only working. Thank you. \$\endgroup\$

Roman
– Roman

2017年12月24日 09:02:49 +00:00
Commented Dec 24, 2017 at 9:02
\$\begingroup\$ dbj.org/c17-codecvt-deprecated-panic ... my shameless plug might help ... \$\endgroup\$

DBJDBJ
– DBJDBJ

2019年07月15日 15:48:07 +00:00
Commented Jul 15, 2019 at 15:48
\$\begingroup\$ @DBJDBJ Your proposed solution is by no means a replacement to wstring_convert from <codecvt>. You kind of down-play the problem by saying the approach is "somewhat controversial" - in my opinion, it's much more than that. It's wrong. I have uses of wstring_convert which your solution cannot replace. It cannot convert true unicode strings like "おはよう", as it's not a true conversion; it's a cast. Consider making that more explicit in your text ;) \$\endgroup\$

Marc.2377
– Marc.2377

2019年09月12日 00:59:34 +00:00
Commented Sep 12, 2019 at 0:59
\$\begingroup\$ @Marc.2377 -- well ,I think I have placed plenty of warnings in that text. I even have mentioned the "casting" word. there is even a link to "that article" ... Many thanks for reading in any case. \$\endgroup\$

DBJDBJ
– DBJDBJ

2019年09月12日 15:17:45 +00:00
Commented Sep 12, 2019 at 15:17
\$\begingroup\$ FFWD to 2019 -- <codecvt> is deprecated \$\endgroup\$

DBJDBJ
– DBJDBJ

2019年09月12日 16:15:12 +00:00
Commented Sep 12, 2019 at 16:15

Add a comment |

Ferruccio Ferruccio 2112 silver badges7 bronze badges · Answer 4 · 2011-01-30 16:05:31Z

6

\$\begingroup\$

One thing that may be an issue is that it assumes the string is ANSI formatted using the currently active code page (CP_ACP). You might want to consider using a specific code page or CP_UTF8 if it's UTF-8.

Share

answered Jan 30, 2011 at 16:05

Ferruccio's user avatar

Ferruccio Ferruccio

2112 silver badges7 bronze badges

\$\endgroup\$

6

\$\begingroup\$ This may be a silly question but, how can I tell? For my usage these will typically be filenames. \$\endgroup\$

Jere.Jones
– Jere.Jones

2011年01月31日 19:49:27 +00:00
Commented Jan 31, 2011 at 19:49
\$\begingroup\$ How do you obtain the filenames? That will determine the correct code page to use. \$\endgroup\$

Ferruccio
– Ferruccio

2011年02月01日 01:30:57 +00:00
Commented Feb 1, 2011 at 1:30
\$\begingroup\$ @Jere.Jones: One way is to check if the string is valid UTF-8. If not, assume it's ANSI. \$\endgroup\$

dan04
– dan04

2011年02月05日 16:31:30 +00:00
Commented Feb 5, 2011 at 16:31
2

\$\begingroup\$ @dan04: ANSI requires that a code page is specified. en.wikipedia.org/wiki/Code_page. \$\endgroup\$

Ferruccio
– Ferruccio

2011年02月05日 21:33:04 +00:00
Commented Feb 5, 2011 at 21:33
2

\$\begingroup\$ Further note: The MSDN documentation recommends not using CP_ACP for strings intended for permanent storage, because the active page may be changed at any time \$\endgroup\$

M.M
– M.M

2018年02月20日 20:46:58 +00:00
Commented Feb 20, 2018 at 20:46

| Show 1 more comment

Roddy Roddy 2031 silver badge6 bronze badges · Answer 5 · 2011-02-02 16:24:04Z

I'd recommend changing this:

int len;
int slength = (int)s.length() + 1;
len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0);

...to this:

int slength = (int)s.length() + 1;
int len = WideCharToMultiByte(CP_ACP, 0, s.c_str(), slength, 0, 0, 0, 0);

Slightly more concise, len's scope is reduced, and you don't have an uninitialised variable floating round (ok, just for one line) as a trap for the unwary.

Mark Loeser Mark Loeser 1,7631 gold badge14 silver badges15 bronze badges · Answer 6 · 2011-01-29 17:49:15Z

3

\$\begingroup\$

I don't do any Windows development, so I can't comment on the WideCharToMultiByte part being safe.

The one thing I would say though is to ensure you are using the proper types for everything. For example, string.length() returns a std::string::size_type (most likely a size_t, the constructor also takes a std::string::size_type, but that one isn't as big of a deal). It probably won't ever bite you, but it is something to be careful of to ensure you don't have any overflows in other code you may be writing.

Share

edited Jan 30, 2011 at 15:42

answered Jan 29, 2011 at 17:49

Mark Loeser's user avatar

Mark Loeser Mark Loeser

1,7631 gold badge14 silver badges15 bronze badges

\$\endgroup\$

6

1

\$\begingroup\$ Well, it returns a std::string::size_type. \$\endgroup\$

Jon Purdy
– Jon Purdy

2011年01月30日 09:15:19 +00:00
Commented Jan 30, 2011 at 9:15
\$\begingroup\$ @Jon: True, but I've never it seen it not be equal to the representation of a size_t. I'll modify the answer though, thanks for your feedback. \$\endgroup\$

Mark Loeser
– Mark Loeser

2011年01月30日 15:41:28 +00:00
Commented Jan 30, 2011 at 15:41
2

\$\begingroup\$ @Jon: std::string::size_type is always a std::size_t. \$\endgroup\$

GManNickG
– GManNickG

2011年01月30日 22:02:46 +00:00
Commented Jan 30, 2011 at 22:02
\$\begingroup\$ @GMan: I was just being pedantic out of boredom. SGI says it's "an unsigned integral type that can represent any nonnegative value of the container's distance type"—that is, difference_type—and that both of these must be typedef s for existing types, but this doesn't imply that size_type has to be equivalent to size_t. Is there something else at work here? \$\endgroup\$

Jon Purdy
– Jon Purdy

2011年01月30日 23:28:42 +00:00
Commented Jan 30, 2011 at 23:28
1

\$\begingroup\$ @Jon: I'm not sure why SGI matters. The standard says std::string::size_type is allocator_type::size_type, and the default allocator's size_type is std::size_t. \$\endgroup\$

GManNickG
– GManNickG

2011年01月31日 05:31:34 +00:00
Commented Jan 31, 2011 at 5:31

| Show 1 more comment

Toby Speight Toby Speight 87.1k14 gold badges104 silver badges322 bronze badges · Answer 7 · 2019-04-10 17:21:16Z

When we're using std::string and std::wstring, we need #include <string>.

There's no declaration of MultiByteToWideChar() or WideCharToMultiByte(). Their names suggest they might be some thin wrapper around std::mbstowcs() and std::wcstombs() respectively, but without seeing them, it's hard to be sure.

It would be much simpler and easier to understand if we used the standard functions to convert between null-terminated multibyte and wide-character strings.

Depending on the use case, a std::codecvt<wchar_t, char, std::mbstate_t> facet might be more appropriate. Then there's very little code to be written, and in particular, no need to guess at the possible output length.

Jonathan Wood Jonathan Wood 3271 silver badge12 bronze badges · Answer 8 · 2011-01-29 22:56:40Z

I've only briefly looked over your code. I haven't worked with std::string much but I've worked a lot with the API.

Assuming you got all your lengths and arguments right (sometimes making sure the terminator and wide vs multibyte lengths are all right can be tricky), I think you're on the right track. I think the first routines you posted unnecessarily allocate an additional buffer. It isn't needed.

user605592 user605592 1 · Answer 9 · 2011-11-05 01:46:54Z

No, this is dangerous! The characters in a std::string may not be stored in a contiguous memory block and you must not use the pointer &r[0] to write to any characters other than that character! This is why the c_str() function returns a const pointer.

It might work with MSVC, but it will probably break if you switch to a different compiler or STL library.

4

\$\begingroup\$ -1: Wrong: stackoverflow.com/questions/2256160/… \$\endgroup\$

Billy ONeal
– Billy ONeal

2011年11月05日 13:59:20 +00:00
Commented Nov 5, 2011 at 13:59

Stack Exchange Network

Converting between std::wstring and std::string

9 Answers 9

Linked

Hot Network Questions

Converting between std::wstring and std::string

9 Answers 9

Linked

Related

Hot Network Questions