Problem/Background:
I want to support only some special Unicode characters for my program (Greek letters and some math operators). C++ provides a char
type, which is always 1 byte in size, however, this is too short to display all characters. Normally this is solved by choosing an encoding like UTF-8 (or 16/32). However, the C++ regex algorithms do not support UTF-8.
So the choice falls on the other data type wchar_t
, which is a platform dependent wide character. On Windows, it spans 2 bytes, so it settles for UTF-16 encoding, while on Linux it's 4 bytes (UTF-32). While researching on this, everybody basically said, you always want to use wchar_t
(or the respective container std::wstring
) on Windows, but never on Linux. Since I don't need too many special characters, I settled on using the char
type internally and convert incoming std::wstring
to my own extended ASCII code page. For convenience, I kept the first 127 characters original to ASCII.
What does my code do?
The function unicode_to_xascii
takes a std::wstring
and quietly removes all characters that are not in my defined codepage. Characters with an id > 255
will be converted in the respective XASCII
value.
Concerns/Questions:
Please comment/answer on these additionally to anything else you might notice in the code.
- Obviously, there's the extra overhead for conversion, but since all other actions in between can be performed on 1 byte, instead of 2 or even 4 bytes, I think this is a valid trade off.
- I tried to avoid magic numbers or obscure bit arithmetic. Please let me know if there's something that could be done clearer.
- Now that I think about it, I could have probably worked something out via operator overloading. Is there a better approach than this functional one?
#pragma once
#ifdef __WIN32__
#include <io.h>
#include <fcntl.h>
#endif
#include <algorithm>
#include <array>
#include <iostream>
#include <regex>
#include <string>
namespace Utility
{
void setup_unicode()
{
std::setlocale(LC_ALL, "en_US.UTF-8");
#ifdef __WIN32__
_setmode(_fileno(stdout), _O_U16TEXT);
_setmode(_fileno(stdin), _O_U16TEXT);
#endif
}
namespace Unicode
{
constexpr uint16_t ASCII_END = 127;
constexpr uint16_t ALPHA = 913;
constexpr uint16_t OMEGA = 937;
constexpr uint16_t alpha = 945;
constexpr uint16_t omega = 969;
constexpr uint16_t circled_plus = 8853;
constexpr uint16_t circled_minus = 8854;
constexpr uint16_t circled_times = 8855;
constexpr uint16_t cross_product = 10799;
constexpr std::array<uint16_t, 4> math_operators = {circled_plus, circled_minus, circled_times, cross_product};
}
namespace XASCII
{
constexpr char BEGIN = '\xc0';
constexpr char ALPHA = BEGIN;
constexpr char OMEGA = ALPHA + Unicode::OMEGA - Unicode::ALPHA;
constexpr char alpha = OMEGA + 1;
constexpr char omega = alpha + Unicode::omega - Unicode::alpha;
constexpr char circled_plus = omega + 1;
constexpr char circled_minus = circled_plus + 1;
constexpr char circled_times = circled_minus + 1;
constexpr char cross_product = circled_times + 1;
constexpr std::array<char, 4> math_operators = {circled_plus, circled_minus, circled_times, cross_product};
constexpr char IGNORE = -1;
constexpr char REGEX_ALPHA_OMEGA[] = {XASCII::ALPHA, '-', XASCII::OMEGA};
constexpr char REGEX_ALPHA_omega[] = {XASCII::ALPHA, '-', XASCII::omega};
}
bool is_utf16_carry_mark_set(uint16_t i)
{
return i & ((1u << 15u) + (1u << 14u));
}
uint16_t to_int(wchar_t w)
{
#ifdef __unix__
auto *p = reinterpret_cast<uint32_t *>(&w);
if( *p >= 1u << 16u ){
*p = static_cast<unsigned char>(XASCII::IGNORE);
}
return static_cast<uint16_t>(*p);
#endif
#ifdef __WIN32__
auto *p = reinterpret_cast<uint16_t*>(&w);
if( is_utf16_carry_mark_set(*p) )
{
*p = XASCII::IGNORE;
}
return static_cast<uint16_t>(*p);
#endif
}
std::string unicode_to_xascii(const std::wstring &wstr)
{
std::string result;
unsigned p = 0, len = wstr.length();
result.resize(len);
for( unsigned k = 0; k < len; ++k ){
uint16_t character = to_int(wstr[k]);
if( character == static_cast<uint16_t>(XASCII::IGNORE)){
continue;
}
if( character <= Unicode::ASCII_END ){
result[p++] = static_cast<unsigned char>(character);
continue;
}
if( Unicode::ALPHA <= character && character <= Unicode::OMEGA ){
result[p++] = static_cast<unsigned char>(XASCII::ALPHA + (character - Unicode::ALPHA));
}
else if( Unicode::alpha <= character && character <= Unicode::omega ){
result[p++] = static_cast<unsigned char>(XASCII::alpha + character - Unicode::alpha);
continue;
}
auto index = std::find(Unicode::math_operators.cbegin(), Unicode::math_operators.cend(), character);
if( index != Unicode::math_operators.cend()){
result[p++] = XASCII::math_operators[index - Unicode::math_operators.cbegin()];
}
}
result.resize(p);
return result;
}
std::wstring xascii_to_unicode(const std::string &str)
{
std::wstring result;
unsigned p = 0, len = str.length();
result.resize(len);
for( unsigned k = 0; k < len; ++k ){
char character = str[k];
if( character == XASCII::IGNORE ){
continue;
}
if( character < XASCII::BEGIN ){
result[p++] = static_cast<wchar_t>(character);
}
else if( XASCII::ALPHA <= character && character <= XASCII::OMEGA ){
result[p++] = static_cast<wchar_t>(Unicode::ALPHA + (character - XASCII::ALPHA));
}
else if( XASCII::alpha <= character && character <= XASCII::omega ){
result[p++] = static_cast<wchar_t>(Unicode::alpha + (character - XASCII::alpha));
}
else{
auto index = std::find(XASCII::math_operators.cbegin(), XASCII::math_operators.cend(), character);
if( index != XASCII::math_operators.cend()){
result[p++] = static_cast<wchar_t>(Unicode::math_operators[index -
XASCII::math_operators.cbegin()]);
}
}
}
result.resize(p);
return result;
}
}
1 Answer 1
Correct me if I'm wrong but aren't functions and overloading operators basically the same thing, a glorified goto statement? I would only put it in if it makes it easier to read when you come back to it in 3 months.
I would change all the 1u << x
to the hex value.
Those carry marks will not change any time soon, so you could make them constants and not rely on that function.
I think you forgot a continue
in your for loop for ALPHA && OMEGA
Are you ever worried about a string longer than an unsigned characters? unsigned k = 0; k < len; ++k
Are you restricting the size of the string else where?
Other than that, it looks okay.