Custom cross-platform extended ASCII converter

Question 1

Problem/Background:

I want to support only some special Unicode characters for my program (Greek letters and some math operators). C++ provides a char type, which is always 1 byte in size, however, this is too short to display all characters. Normally this is solved by choosing an encoding like UTF-8 (or 16/32). However, the C++ regex algorithms do not support UTF-8.

So the choice falls on the other data type wchar_t, which is a platform dependent wide character. On Windows, it spans 2 bytes, so it settles for UTF-16 encoding, while on Linux it's 4 bytes (UTF-32). While researching on this, everybody basically said, you always want to use wchar_t (or the respective container std::wstring) on Windows, but never on Linux. Since I don't need too many special characters, I settled on using the char type internally and convert incoming std::wstring to my own extended ASCII code page. For convenience, I kept the first 127 characters original to ASCII.

What does my code do?

The function unicode_to_xascii takes a std::wstring and quietly removes all characters that are not in my defined codepage. Characters with an id > 255 will be converted in the respective XASCII value.

Concerns/Questions:

Please comment/answer on these additionally to anything else you might notice in the code.

Obviously, there's the extra overhead for conversion, but since all other actions in between can be performed on 1 byte, instead of 2 or even 4 bytes, I think this is a valid trade off.
I tried to avoid magic numbers or obscure bit arithmetic. Please let me know if there's something that could be done clearer.
Now that I think about it, I could have probably worked something out via operator overloading. Is there a better approach than this functional one?

#pragma once
#ifdef __WIN32__
 #include <io.h>
 #include <fcntl.h>
#endif
#include <algorithm>
#include <array>
#include <iostream>
#include <regex>
#include <string>
namespace Utility
{
 void setup_unicode()
 {
 std::setlocale(LC_ALL, "en_US.UTF-8");
 #ifdef __WIN32__
 _setmode(_fileno(stdout), _O_U16TEXT);
 _setmode(_fileno(stdin), _O_U16TEXT);
 #endif
 }
 namespace Unicode
 {
 constexpr uint16_t ASCII_END = 127;
 constexpr uint16_t ALPHA = 913;
 constexpr uint16_t OMEGA = 937;
 constexpr uint16_t alpha = 945;
 constexpr uint16_t omega = 969;
 constexpr uint16_t circled_plus = 8853;
 constexpr uint16_t circled_minus = 8854;
 constexpr uint16_t circled_times = 8855;
 constexpr uint16_t cross_product = 10799;
 constexpr std::array<uint16_t, 4> math_operators = {circled_plus, circled_minus, circled_times, cross_product};
 }
 namespace XASCII
 {
 constexpr char BEGIN = '\xc0';
 constexpr char ALPHA = BEGIN;
 constexpr char OMEGA = ALPHA + Unicode::OMEGA - Unicode::ALPHA;
 constexpr char alpha = OMEGA + 1;
 constexpr char omega = alpha + Unicode::omega - Unicode::alpha;
 constexpr char circled_plus = omega + 1;
 constexpr char circled_minus = circled_plus + 1;
 constexpr char circled_times = circled_minus + 1;
 constexpr char cross_product = circled_times + 1;
 constexpr std::array<char, 4> math_operators = {circled_plus, circled_minus, circled_times, cross_product};
 constexpr char IGNORE = -1;
 constexpr char REGEX_ALPHA_OMEGA[] = {XASCII::ALPHA, '-', XASCII::OMEGA};
 constexpr char REGEX_ALPHA_omega[] = {XASCII::ALPHA, '-', XASCII::omega};
 }
 bool is_utf16_carry_mark_set(uint16_t i)
 {
 return i & ((1u << 15u) + (1u << 14u));
 }
 uint16_t to_int(wchar_t w)
 {
 #ifdef __unix__
 auto *p = reinterpret_cast<uint32_t *>(&w);
 if( *p >= 1u << 16u ){
 *p = static_cast<unsigned char>(XASCII::IGNORE);
 }
 return static_cast<uint16_t>(*p);
 #endif
 #ifdef __WIN32__
 auto *p = reinterpret_cast<uint16_t*>(&w);
 if( is_utf16_carry_mark_set(*p) )
 {
 *p = XASCII::IGNORE;
 }
 return static_cast<uint16_t>(*p);
 #endif
 }
 std::string unicode_to_xascii(const std::wstring &wstr)
 {
 std::string result;
 unsigned p = 0, len = wstr.length();
 result.resize(len);
 for( unsigned k = 0; k < len; ++k ){
 uint16_t character = to_int(wstr[k]);
 if( character == static_cast<uint16_t>(XASCII::IGNORE)){
 continue;
 }
 if( character <= Unicode::ASCII_END ){
 result[p++] = static_cast<unsigned char>(character);
 continue;
 }
 if( Unicode::ALPHA <= character && character <= Unicode::OMEGA ){
 result[p++] = static_cast<unsigned char>(XASCII::ALPHA + (character - Unicode::ALPHA));
 }
 else if( Unicode::alpha <= character && character <= Unicode::omega ){
 result[p++] = static_cast<unsigned char>(XASCII::alpha + character - Unicode::alpha);
 continue;
 }
 auto index = std::find(Unicode::math_operators.cbegin(), Unicode::math_operators.cend(), character);
 if( index != Unicode::math_operators.cend()){
 result[p++] = XASCII::math_operators[index - Unicode::math_operators.cbegin()];
 }
 }
 result.resize(p);
 return result;
 }
 std::wstring xascii_to_unicode(const std::string &str)
 {
 std::wstring result;
 unsigned p = 0, len = str.length();
 result.resize(len);
 for( unsigned k = 0; k < len; ++k ){
 char character = str[k];
 if( character == XASCII::IGNORE ){
 continue;
 }
 if( character < XASCII::BEGIN ){
 result[p++] = static_cast<wchar_t>(character);
 }
 else if( XASCII::ALPHA <= character && character <= XASCII::OMEGA ){
 result[p++] = static_cast<wchar_t>(Unicode::ALPHA + (character - XASCII::ALPHA));
 }
 else if( XASCII::alpha <= character && character <= XASCII::omega ){
 result[p++] = static_cast<wchar_t>(Unicode::alpha + (character - XASCII::alpha));
 }
 else{
 auto index = std::find(XASCII::math_operators.cbegin(), XASCII::math_operators.cend(), character);
 if( index != XASCII::math_operators.cend()){
 result[p++] = static_cast<wchar_t>(Unicode::math_operators[index -
 XASCII::math_operators.cbegin()]);
 }
 }
 }
 result.resize(p);
 return result;
 }
}

Question 2

Correct me if I'm wrong but aren't functions and overloading operators basically the same thing, a glorified goto statement? I would only put it in if it makes it easier to read when you come back to it in 3 months.

I would change all the 1u << x to the hex value. Those carry marks will not change any time soon, so you could make them constants and not rely on that function.

I think you forgot a continue in your for loop for ALPHA && OMEGA

Are you ever worried about a string longer than an unsigned characters? unsigned k = 0; k < len; ++k Are you restricting the size of the string else where?

Other than that, it looks okay.

David Fisher David Fisher 3701 silver badge6 bronze badges · Accepted Answer · 2020-06-17 23:42:48Z

Correct me if I'm wrong but aren't functions and overloading operators basically the same thing, a glorified goto statement? I would only put it in if it makes it easier to read when you come back to it in 3 months.

I would change all the 1u << x to the hex value. Those carry marks will not change any time soon, so you could make them constants and not rely on that function.

I think you forgot a continue in your for loop for ALPHA && OMEGA

Are you ever worried about a string longer than an unsigned characters? unsigned k = 0; k < len; ++k Are you restricting the size of the string else where?

Other than that, it looks okay.

Stack Exchange Network

Custom cross-platform extended ASCII converter

Problem/Background:

What does my code do?

Concerns/Questions:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Custom cross-platform extended ASCII converter

Problem/Background:

What does my code do?

Concerns/Questions:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions