I created a function that truncates an incomplete UTF-8 character at the end of std::stringstd::string
in c++C++.
C++ standard libraryC++'s Standard Library does not yet support character based substrsubstr
on UTF-8 characters and does substrsubstr
by number of bytes only.
Because of itthat, in the example below, substrsubstr
causes a weird broken character to appear in the end.
It seems like my function is working ok, but, I would like to get some feedback on possible problems and improvements with the code.
Here is my code
I created a function that truncates an incomplete UTF-8 character at the end of std::string in c++.
C++ standard library does not yet support character based substr on UTF-8 characters and does substr by number of bytes only.
Because of it, in the example below, substr causes a weird broken character to appear in the end.
It seems like my function is working ok but I would like to get some feedback on possible problems and improvements with the code.
Here is my code
I created a function that truncates an incomplete UTF-8 character at the end of std::string
in C++.
C++'s Standard Library does not yet support character based substr
on UTF-8 characters and does substr
by number of bytes only.
Because of that, in the example below, substr
causes a weird broken character to appear in the end.
It seems like my function is working, but, I would like to get some feedback on possible problems and improvements.
Truncating Incomplete UTF-8 character
I created a function that truncates an incomplete UTF-8 character at the end of std::string in c++.
C++ standard library does not yet support character based substr on UTF-8 characters and does substr by number of bytes only.
Because of it, in the example below, substr causes a weird broken character to appear in the end.
std::string utfstr = "옷三옷白옷옷-어<어<어<어<-";
std::cout << utfstr.substr(0, 5) << std::endl;
It seems like my function is working ok but I would like to get some feedback on possible problems and improvements with the code.
Here is my code
#include <string>
#include <iostream>
using namespace std;
ssize_t TrimEndUTF8(std::string& str) {
// Scans backward from the end of string.
const char* cptr = &str.back();
int num = 1;
int numBytesToTruncate = 0;
for (int i = 0; 6 > i; ++i) {
numBytesToTruncate += 1;
if ((*cptr & 0x80) == 0x80) { // If char bit starts with 1xxxxxxx
// It's a part of unicode character!
// Find the first byte in the unicode character!
//if ((*cptr & 0xFC) == 0xFC) { if (num == 6) { return 0; } break; }
//if ((*cptr & 0xF8) == 0xF8) { if (num == 5) { return 0; } break; }
// If char binary is 11110000, it means it's a 4 bytes long unicode.
if ((*cptr & 0xF0) == 0xF0) { if (num == 4) { return 0; } break; }
// If char binary is 11100000, it means it's a 3 bytes long unicode.
if ((*cptr & 0xE0) == 0xE0) { if (num == 3) { return 0; } break; }
if ((*cptr & 0xC0) == 0xC0) { if (num == 2) { return 0; } break; }
num += 1;
} else {
// If char bit does not start with 1, nothing to truncate!
return 0;
}
cptr -= 1;
}
str.resize(str.length() - numBytesToTruncate);
return numBytesToTruncate;
}
int main() {
for (int i = 1; 30 > i; ++i) {
std::string utfStr = "안-녕<하>세d요e만f나g서반갑습니다";
std::string substred = utfStr.substr(0, i);
size_t trimmed = TrimEndUTF8(substred);
cout << "Trimmed " << trimmed << " bytes" << endl;
cout << substred << endl;
}
for (int i = 1; 30 > i; ++i) {
std::string utfStr = "𠜎_𠜱_𠝹_𠱓_𠱸_𠲖𠳏𠳕𠴕𠵼-𠵿-𠸎-𠸏-𠹷-𠺝-𠺢𠻗";
std::string substred = utfStr.substr(0, i);
size_t trimmed = TrimEndUTF8(substred);
cout << "Trimmed " << trimmed << " bytes" << endl;
cout << substred << endl;
}
return 0;
}