Appending a codepoint to an UTF8 std::string using icu4c

Question 1

My code is

void utf8_append(UChar32 cp, std::string& str) {
 size_t offset = str.size();
 str.resize(offset + U8_LENGTH(cp));
 auto ptr = reinterpret_cast<uint8_t*>(&str[0]);
 U8_APPEND_UNSAFE(ptr, offset, static_cast<uint32_t>(cp));
}

This works but seems ugly. Maybe I am overlooking a simpler approach?

Relevant documentation: https://unicode-org.github.io/icu/userguide/strings/utf-8.html and https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/utf8_8h.html.

Question 2

Beauty is in the eye of the beholder. I say it is perfectly valid and correct code! The only thing you might get rid of is the static_cast<uint32_t>, as an UChar32, which is an alias forint32_t, will implicitly cast to uint32_t without warnings. You could also use append() instead of resize(), avoiding the addition, and remove the temporary ptr, to finally get:

void utf8_append(UChar32 cp, std::string& str) {
 auto offset = str.size();
 str.append(U8_LENGTH(cp), {});
 U8_APPEND_UNSAFE(reinterpret_cast<uint8_t *>(&str[0]), offset, cp);
}

If you can use C++17, str.data() is slightly nicer than &str[0] in my opinion. Or you could write &str.front().

Question 3

"without warnings" Not with this project's warning settings. And no C++17, unfortunately.

Question 4

Ah ok. Well if they are that strict then you're stuck with the static_cast of course. You could consider using std::basic_string<uint8_t> to get rid of both casts, but it will probably open up a can of worms elsewhere in your codebase.

G. Sliepen G. Sliepen 68.8k3 gold badges74 silver badges179 bronze badges · Accepted Answer · 2020-10-21 18:19:32Z

Beauty is in the eye of the beholder. I say it is perfectly valid and correct code! The only thing you might get rid of is the static_cast<uint32_t>, as an UChar32, which is an alias forint32_t, will implicitly cast to uint32_t without warnings. You could also use append() instead of resize(), avoiding the addition, and remove the temporary ptr, to finally get:

void utf8_append(UChar32 cp, std::string& str) {
 auto offset = str.size();
 str.append(U8_LENGTH(cp), {});
 U8_APPEND_UNSAFE(reinterpret_cast<uint8_t *>(&str[0]), offset, cp);
}

If you can use C++17, str.data() is slightly nicer than &str[0] in my opinion. Or you could write &str.front().

"without warnings" Not with this project's warning settings. And no C++17, unfortunately.
Ah ok. Well if they are that strict then you're stuck with the static_cast of course. You could consider using std::basic_string<uint8_t> to get rid of both casts, but it will probably open up a can of worms elsewhere in your codebase.

Stack Exchange Network

Appending a codepoint to an UTF8 std::string using icu4c

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Appending a codepoint to an UTF8 std::string using icu4c

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions