Windows console windows do unfortunately not support stream I/O of international characters. For instance, in Windows 7, you can still do "chcp 65001" (sets the active code page to UTF-8), type "more", and get a crash.
This means that it's practically impossible for a novice to write a "Hello, world!" program that:
- has national characters in literals
- will yield the same results in *nix and Windows regardless of country
After a little discussion about this on a Boost mailing list I set down to implement more fully an idea that I sketched there, namely to use the natural encoding of the OS for internal strings, which in practice then means UTF-8 for internal strings on *nix and UTF-16 for internal strings on Windows. It's worth noting that e.g. the ICU library, at least according to its documentation, uses UTF-16 encoded strings.
With UTF-8 as a common external coding, one has:
Extern Intern E.g. ICU library UTF-8 - UTF-8 <-> UTF-16 <- for *nix UTF-8 <-> UTF-16 - UTF-16 <- for Windows
To support that I define a macro with the incredibly short name U
, which adapts a literal to the platform one compiles for, and creates a strongly typed string or character. The typing helps to avoid using functions in a non-portable manner. And it enables argument dependent lookup, like this:
#include <progrock/cppx/u/adapted_iostream.h>
using namespace std;
namespace u = progrock::cppx::u;
int main()
{
u::out << U( "Hello, world!" ) << endl;
u::out << U( "2+2 = " ) << 2 + 2 << endl;
u::out << U( "Blåbærsyltetøy! 日本国 кошка!" ) << endl;
}
Here u::out
is either std::cout
(for *nix) or std::wcout
(for Windows). And the source code needs to be stored as UTF-8 with BOM in order to compile nicely with both g++ and Visual C++.
If standard output goes to a Windows console window, then the Norwegian, Russian and Chinese characters result in correct Unicode code points in the console window's text buffer. The Norwegian and Russian displays OK, the Chinese displays as rectangles (on my machine), and the text can be copied correctly to e.g. Notepad, which can display it all. If standard output is redirected or piped, then the result is UTF-8.
This is mostly just hobby programming, and only with the compilers I happen to have on my machine, namely Visual C++ and g++. I'm hoping that some people not only understand the question but are able to give useful feedback.
progrock/cppx/u/adapted_iostream.h
#pragma once
// Copyright (c) 2011, Alf P. Steinbach
//--------------------------------------------------------- Dependencies:
#include <progrock/cppx/u/translating_streams.compiler_dependent.h> // translating?Stream
#include <progrock/cppx/u/cpp_string_output.os_dependent.h> // writeTo
#include <progrock/cppx/u.h> // Encoding::Enum etc.
#include <iostream>
#include <locale>
//#include <codecvt> // C++11 std::codecvt_utf8_utf16, not supported by g++ 4.4.1.
#include <progrock/cppx/u/CodecUtf8.h> // Sort of equivalent DIY functionality instead.
//--------------------------------------------------------- Implementation:
namespace progrock{ namespace cppx{ namespace u {
namespace naturalEncoding {
namespace detail {
template< Encoding::Enum encoding >
struct StdStreamsBase
{
typedef EncodingTraits< encoding > Traits;
typedef std::basic_istream< typename Traits::Raw::Value > IStream;
typedef std::basic_ostream< typename Traits::Raw::Value > OStream;
};
}; // namespace detail
template< Encoding::Enum encoding >
struct StdStreams;
template<>
struct StdStreams< Encoding::utf8 >
: detail::StdStreamsBase< Encoding::utf8 >
{
static IStream& inStream() { return std::cin; }
static OStream& outStream() { return std::cout; }
static OStream& errStream() { return std::cerr; }
static OStream& logStream() { return std::clog; }
};
template<>
struct StdStreams< Encoding::utf16 >
: detail::StdStreamsBase< Encoding::utf16 >
{
private:
template< class Stream >
static Stream& withUtf8Conversion( Stream& stream )
{
std::locale const utf8Locale( stream.getloc(), new CodecUtf8() );
stream.imbue( utf8Locale );
return stream;
}
public:
static IStream& inStream()
{
static IStream& stream = withUtf8Conversion( translatingInWStream() );
return stream;
}
static OStream& outStream()
{
static OStream& stream = withUtf8Conversion( translatingOutWStream() );
return stream;
}
static OStream& errStream()
{
static OStream& stream = withUtf8Conversion( translatingErrWStream() );
return stream;
}
static OStream& logStream()
{
static OStream& stream = withUtf8Conversion( translatingLogWStream() );
return stream;
}
};
} // namespace naturalEncoding
typedef naturalEncoding::StdStreams< u::encoding >::IStream IStream;
typedef naturalEncoding::StdStreams< u::encoding >::OStream OStream;
static IStream& in = naturalEncoding::StdStreams< encoding >::inStream();
static OStream& out = naturalEncoding::StdStreams< encoding >::outStream();
static OStream& err = naturalEncoding::StdStreams< encoding >::errStream();
static OStream& log = naturalEncoding::StdStreams< encoding >::logStream();
inline bool isCommon( OStream const& stream )
{
OStream const* const p = &stream;
return (p == &out || p == &err || p == &log);
}
inline OStream& operator<<( OStream& stream, CodingValue const v )
{
if( isCommon( stream ) ) // <-- Is just an optimization.
{
return writeTo( stream, v ); // Special-cases Windows console output.
}
else
{
return stream << raw( v );
}
}
inline OStream& operator<<( OStream& stream, CodingValue const s[] )
{
if( isCommon( stream ) ) // <-- Is just an optimization.
{
return writeTo( stream, s ); // Special-cases Windows console output.
}
else
{
return stream << raw( s );
}
}
} } } // namespace progrock::cppx::u
As I copied it for posting it here, I added a couple of inline
that I'd forgotten. This is very much a snapshot of a work in progress. It works for output (I haven't tested C++ iostream level input yet), but possibly very far from perfect code!
progrock\cppx\u.h
#pragma once
// Copyright (c) 2011, Alf P. Steinbach
//--------------------------------------------------------- Dependencies:
#include <progrock/cppx/u_encoding_choice.os_dependent.h> // CPPX_NATURAL_ENCODING
#include <progrock/cppx/c++11_emulation.h> // CPPX_STATIC_ASSERT, CPPX_NOEXCEPT
#include <progrock/cppx/stdlib_typedefs.h> // cppx::Size
#include <locale> // std::char_traits
#include <utility> // comparison operators
//--------------------------------------------------------- Interface:
#if CPPX_NATURAL_ENCODING == CPPX_ENCODING_UTF16
#
CPPX_STATIC_ASSERT( sizeof( wchar_t ) == 2 );
# define CPPX_U_ENCODING ::progrock::cppx::u::Encoding::utf16
# define U( aLiteral ) ::progrock::cppx::u::typed( L##aLiteral )
#
#elif CPPX_NATURAL_ENCODING == CPPX_ENCODING_UTF8
#
# define CPPX_U_ENCODING ::progrock::cppx::u::Encoding::utf8
# define U( aLiteral ) ::progrock::cppx::u::typed( aLiteral )
#
#else
# error "The natural encoding for this OS is not supported, sorry."
#endif
namespace progrock { namespace cppx { namespace u {
using namespace std::rel_ops; // operator!= etc.
struct Encoding { enum Enum{ ansi, utf8, utf16 }; };
template< Encoding::Enum a > struct EncodingUnit;
template<> struct EncodingUnit< Encoding::ansi > { typedef char Type; };
template<> struct EncodingUnit< Encoding::utf8 > { typedef char Type; };
template<> struct EncodingUnit< Encoding::utf16 > { typedef wchar_t Type; };
template< Encoding::Enum e >
struct EncodingTraits; // Must be specialized due to Visual C++ bug.
template<>
struct EncodingTraits< Encoding::utf8 >
{
typedef char UnitType;
typedef std::char_traits< UnitType > UnitTraits;
struct Raw
{
typedef UnitType Value;
typedef UnitTraits::int_type ExtendedValue;
};
enum Value : Raw::Value {};
enum ExtendedValue : Raw::ExtendedValue {};
CPPX_STATIC_ASSERT( sizeof( Value ) == sizeof( Raw::Value ) );
CPPX_STATIC_ASSERT( sizeof( ExtendedValue ) == sizeof( Raw::ExtendedValue ) );
};
template<>
struct EncodingTraits< Encoding::utf16 >
{
typedef wchar_t UnitType;
typedef std::char_traits< UnitType > UnitTraits;
struct Raw
{
typedef UnitType Value;
typedef UnitTraits::int_type ExtendedValue;
};
enum Value : Raw::Value {};
enum ExtendedValue : Raw::ExtendedValue {};
CPPX_STATIC_ASSERT( sizeof( Value ) == sizeof( Raw::Value ) );
CPPX_STATIC_ASSERT( sizeof( ExtendedValue ) == sizeof( Raw::ExtendedValue ) );
};
Encoding::Enum const encoding = CPPX_U_ENCODING;
typedef EncodingTraits< encoding > Traits;
typedef Traits::Value CodingValue;
typedef Traits::ExtendedValue ExtendedCodingValue;
typedef Traits::Raw::Value RawCodingValue;
typedef Traits::Raw::ExtendedValue RawExtendedCodingValue;
inline RawCodingValue raw( CodingValue const v ) CPPX_NOEXCEPT
{
return v;
}
inline RawCodingValue* raw( CodingValue* p ) CPPX_NOEXCEPT
{
return reinterpret_cast< RawCodingValue* >( p );
}
inline RawCodingValue const* raw( CodingValue const* p ) CPPX_NOEXCEPT
{
return reinterpret_cast< RawCodingValue const* >( p );
}
template< Size size >
inline RawCodingValue (&raw( CodingValue (&s)[size] )) [size]
{
return reinterpret_cast< RawCodingValue (&)[size] >( s );
}
template< Size size >
inline RawCodingValue const (&raw( CodingValue const (&s)[size] ) CPPX_NOEXCEPT)[size]
{
return reinterpret_cast< RawCodingValue const (&)[size] >( s );
}
enum Koenig {};
inline CodingValue typed( RawCodingValue const v ) CPPX_NOEXCEPT
{
return CodingValue( v );
}
inline CodingValue typed( Koenig, RawCodingValue const v ) CPPX_NOEXCEPT
{
return CodingValue( v );
}
inline CodingValue* typed( RawCodingValue* const p ) CPPX_NOEXCEPT
{
return reinterpret_cast< CodingValue* >( p );
}
inline CodingValue* typed( Koenig, RawCodingValue* const p ) CPPX_NOEXCEPT
{
return reinterpret_cast< CodingValue* >( p );
}
inline CodingValue const* typed( RawCodingValue const* const p ) CPPX_NOEXCEPT
{
return reinterpret_cast< CodingValue const* >( p );
}
inline CodingValue const* typed( Koenig, RawCodingValue const* const p ) CPPX_NOEXCEPT
{
return reinterpret_cast< CodingValue const* >( p );
}
template< Size size >
inline CodingValue (&typed( RawCodingValue (&s)[size] ) CPPX_NOEXCEPT)[size]
{
return reinterpret_cast< CodingValue (&)[size] >( s );
}
template< Size size >
inline CodingValue (&typed( Koenig, RawCodingValue (&s)[size] ) CPPX_NOEXCEPT)[size]
{
return reinterpret_cast< CodingValue (&)[size] >( s );
}
template< Size size >
inline CodingValue const (&typed( RawCodingValue const (&s)[size] ) CPPX_NOEXCEPT)[size]
{
return reinterpret_cast< CodingValue const (&)[size] >( s );
}
template< Size size >
inline CodingValue const (&typed( Koenig, RawCodingValue const (&s)[size] ) CPPX_NOEXCEPT)[size]
{
return reinterpret_cast< CodingValue const (&)[size] >( s );
}
} } } // namespace progrock::cppx::u
namespace std {
// Requirements specified by C++11 §21.2.1/1 table 62.
template<>
class char_traits< ::progrock::cppx::u::CodingValue >
{
private:
typedef ::progrock::cppx::u::Koenig adl;
public:
typedef ::progrock::cppx::u::CodingValue char_type;
typedef ::progrock::cppx::u::ExtendedCodingValue int_type;
typedef ::progrock::cppx::u::Traits::UnitTraits Std;
typedef Std::off_type off_type;
typedef Std::pos_type pos_type;
typedef Std::state_type state_type;
static bool eq( char_type a, char_type b ) CPPX_NOEXCEPT
{ return (a == b); }
static bool lt( char_type a, char_type b ) CPPX_NOEXCEPT
{ return (a < b); }
static int compare( char_type const* s1, char_type const* s2, size_t n )
{ return Std::compare( raw( s1 ), raw( s2 ), n ); }
static size_t length( char_type const* s )
{ return Std::length( raw( s ) ); }
static char_type const* find( char_type const* s, size_t n, char_type const a )
{ return typed( adl(), Std::find( raw( s ), n, raw( a ) ) ); }
static char_type* move( char_type* s1, char_type const* s2, size_t n )
{ return typed( adl(), Std::move( raw( s1 ), raw( s2 ), n ) ); }
static char_type* copy( char_type* s1, char_type const* s2, size_t n )
{ return typed( adl(), Std::copy( raw( s1 ), raw( s2 ), n ) ); }
static void assign( char_type& c1, char_type const c2 ) CPPX_NOEXCEPT
{ c1 = c2; }
static char_type* assign( char_type* s, size_t n, char_type const a )
{ return typed( adl(), Std::assign( raw( s ), n, raw( a ) ) ); }
static int_type not_eof( int_type const c ) CPPX_NOEXCEPT
{ return int_type( Std::not_eof( c ) ); }
static char_type to_char_type( int_type const c ) CPPX_NOEXCEPT
{ return typed( c ); }
static int_type to_int_type( char_type const c ) CPPX_NOEXCEPT
{ return int_type( c ); }
static bool eq_int_type( int_type const c1, int_type const c2 ) CPPX_NOEXCEPT
{ return (c1 == c2); }
static int_type eof() CPPX_NOEXCEPT
{ return int_type( Std::eof() ); }
};
} // namespace std
1 Answer 1
Since the code is supposed to work for Unix the pragma is a bad idea.
#pragma once
Prefer to use normal include guards.
You are imbuing streams here:
std::locale const utf8Locale( stream.getloc(), new CodecUtf8() );
stream.imbue( utf8Locale );
return stream;
The only problem I see with this is that after you have started using the stream any attempt to imbue can silently fail (or it used too they may have fixed that in C++11).
Now I assume you are trying to force this initialization before use with:
static IStream& in = naturalEncoding::StdStreams< encoding >::inStream();
static OStream& out = naturalEncoding::StdStreams< encoding >::outStream();
static OStream& err = naturalEncoding::StdStreams< encoding >::errStream();
static OStream& log = naturalEncoding::StdStreams< encoding >::logStream();
This will work 99% of the time but if somebody starts logging (using one of the std:: streams (in/out/err/log) in the constructor of a global scope static storage duration object then all bets are off). Since this is a rare case I am not too worried but you should document this somewhere like at the top of the header file (assuming it is still a problem).
I don't see a definition for U()
or writeTo()
or raw()
or CodingValue
-
\$\begingroup\$ Thanks for those comments. Does the
#pragma once
not work in nix? Regarding the definitions you ask for,U()
andraw()
is one header, whilewriteTo
is a platform-dependent operation and so is in a couple of headers. I think it is perhaps enough to post the former header? \$\endgroup\$Alf P. Steinbach– Alf P. Steinbach2011年11月08日 15:50:27 +00:00Commented Nov 8, 2011 at 15:50 -
1\$\begingroup\$ All pragmas are compiler specific. Not all compilers support
once
. \$\endgroup\$Loki Astari– Loki Astari2011年11月08日 15:54:52 +00:00Commented Nov 8, 2011 at 15:54 -
1\$\begingroup\$ well, that's an well-considered engineering decision. support everything under the sun, or something more reasonable? i landed on reasonable. the issue has been been discussed extensively even here on SO. and i couldn't support those non-mainstream compilers anyway -- there's enough compiler specific stuff with just g++ and msvc involved. \$\endgroup\$Alf P. Steinbach– Alf P. Steinbach2011年11月08日 16:03:29 +00:00Commented Nov 8, 2011 at 16:03
-
\$\begingroup\$ @AlfP.Steinbach: Three lines rather than one does not seem an unreasonable cost for greater support. I assume your platform specific files will generate the appropriate errors for unsupported platforms. \$\endgroup\$Loki Astari– Loki Astari2011年11月08日 18:05:02 +00:00Commented Nov 8, 2011 at 18:05
-
1\$\begingroup\$ @AlfP.Steinbach: Maybe you should change your support from *nix to Linux (gcc is practically default for Linux but not *nix). \$\endgroup\$Loki Astari– Loki Astari2011年11月08日 18:05:46 +00:00Commented Nov 8, 2011 at 18:05
#include <progrock/cppx/u/adapted_iostream.h>
then there would be something to review. But the code above is meaningless without it. \$\endgroup\$