Return to Question

improved formatting; grammar; code snippets

edited Jul 23, 2015 at 9:37

I created a function that truncates an incomplete UTF-8 character at the end of std::stringstd::string in c++C++.

C++ standard libraryC++'s Standard Library does not yet support character based substrsubstr on UTF-8 characters and does substrsubstr by number of bytes only.

Because of itthat, in the example below, substrsubstr causes a weird broken character to appear in the end.

It seems like my function is working ok, but, I would like to get some feedback on possible problems and improvements with the code.

Here is my code

I created a function that truncates an incomplete UTF-8 character at the end of std::string in c++.

C++ standard library does not yet support character based substr on UTF-8 characters and does substr by number of bytes only.

Because of it, in the example below, substr causes a weird broken character to appear in the end.

It seems like my function is working ok but I would like to get some feedback on possible problems and improvements with the code.

Here is my code

I created a function that truncates an incomplete UTF-8 character at the end of std::string in C++.

C++'s Standard Library does not yet support character based substr on UTF-8 characters and does substr by number of bytes only.

Because of that, in the example below, substr causes a weird broken character to appear in the end.

It seems like my function is working, but, I would like to get some feedback on possible problems and improvements.

Source Link

asked Jul 23, 2015 at 8:42

LETs

asked Jul 23, 2015 at 8:42

LETs

Truncating Incomplete UTF-8 character

I created a function that truncates an incomplete UTF-8 character at the end of std::string in c++.

C++ standard library does not yet support character based substr on UTF-8 characters and does substr by number of bytes only.

Because of it, in the example below, substr causes a weird broken character to appear in the end.

std::string utfstr = "옷三옷白옷옷-어<어<어<어<-";
std::cout << utfstr.substr(0, 5) << std::endl;

It seems like my function is working ok but I would like to get some feedback on possible problems and improvements with the code.

Here is my code

#include <string>
#include <iostream>
using namespace std;
ssize_t TrimEndUTF8(std::string& str) {
 // Scans backward from the end of string.
 const char* cptr = &str.back();
 int num = 1;
 int numBytesToTruncate = 0;
 for (int i = 0; 6 > i; ++i) {
 numBytesToTruncate += 1;
 if ((*cptr & 0x80) == 0x80) { // If char bit starts with 1xxxxxxx
 // It's a part of unicode character!
 // Find the first byte in the unicode character!
 
 //if ((*cptr & 0xFC) == 0xFC) { if (num == 6) { return 0; } break; }
 //if ((*cptr & 0xF8) == 0xF8) { if (num == 5) { return 0; } break; }
 
 // If char binary is 11110000, it means it's a 4 bytes long unicode.
 if ((*cptr & 0xF0) == 0xF0) { if (num == 4) { return 0; } break; }
 // If char binary is 11100000, it means it's a 3 bytes long unicode.
 if ((*cptr & 0xE0) == 0xE0) { if (num == 3) { return 0; } break; }
 if ((*cptr & 0xC0) == 0xC0) { if (num == 2) { return 0; } break; } 
 num += 1;
 } else {
 // If char bit does not start with 1, nothing to truncate!
 return 0;
 }
 cptr -= 1;
 }
 str.resize(str.length() - numBytesToTruncate);
 return numBytesToTruncate;
}
int main() {
 for (int i = 1; 30 > i; ++i) {
 std::string utfStr = "안-녕<하>세d요e만f나g서반갑습니다";
 std::string substred = utfStr.substr(0, i);
 size_t trimmed = TrimEndUTF8(substred);
 cout << "Trimmed " << trimmed << " bytes" << endl;
 cout << substred << endl;
 }
 for (int i = 1; 30 > i; ++i) {
 std::string utfStr = "𠜎_𠜱_𠝹_𠱓_𠱸_𠲖𠳏𠳕𠴕𠵼-𠵿-𠸎-𠸏-𠹷-𠺝-𠺢𠻗";
 std::string substred = utfStr.substr(0, i);
 size_t trimmed = TrimEndUTF8(substred);
 cout << "Trimmed " << trimmed << " bytes" << endl;
 cout << substred << endl;
 }
 return 0;
}

c++ utf-8

lang-cpp