UTF8 Codepoint decode and length

Question 1

I needed a function that would:

Decode, and return, the first character in an UTF8 encoded strings
Return the length of encoding with the special case that lenght of '0円' must be 0
Perfomance are important

I had no special requirement on what to do with invalid sequences so I opted for the following behaviour:

The first byte of an invalid sequences is considered as a single "character" (e.g. for the sequence "\xFF\x2F" it would return '\xFF' as value and 1 as length).
Overlong encoding are accepted

I wrote the following function:

static uint8_t LEN[] = {1,1,1,1,2,2,3,0};
static uint8_t MSK[] = {0,0,3,4,5,0,0,0};
static int utf8_cp(char *txt, int32_t *ch)
{
 int len = 0;
 int32_t val = 0;
 uint8_t first = (uint8_t)(*txt);
 len = (first > 0) * (1 + ((first & 0xC0) == 0xC0) * LEN[(first >> 3) & 7]);
 val = first & (0xFF >> MSK[len]);
 for (int k=len; k>1; k--) {
 if ((*++txt & 0xC0) != 0x80) {
 val = first;
 len = 1;
 break;
 }
 val = (val << 6) | (*txt & 0x3F);
 }
 *ch = val;
 return len;
}

So that a code like:

char *t; int l; int32_t c;
t = "aàも𫀔";
while(1) {
 l = utf8_cp(t, &c);
 printf("'%s' len:%d cp:0x%05x\n", t, l, c);
 if (*t == 0) break;
 t += l;
}

Produces:

'aàも𫀔' len:1 cp:0x00061
'àも𫀔' len:2 cp:0x000e0
'も𫀔' len:3 cp:0x03082
'𫀔' len:4 cp:0x2b014
'' len:0 cp:0x00000

To make it faster I thought about unrolling the for loop (but I wonder how much could I gain) and introducing, at the beginning, some if to handle ASCII character (but I fear that branching could be more costly that just making a bunch of operation).

I will appreciate any comment you may have and any suggestion for improvement.

Question 2

To accomodate some of the comments in the review, I submitted a different version of the function above: codereview.stackexchange.com/questions/142323/…

Question 3

Rather than use narrow types, use fastest ones

// uint8_t first = (uint8_t)(*txt);
unsigned first = (uint8_t)(*txt);
// or
uint_fast8_t first = (uint8_t)(*txt);

Rather than lookup a value to shift, look up the shifted value.

// static uint8_t MSK[] = {0,0,3,4,5,0,0,0};
// val = first & (0xFF >> MSK[len]);
static const uint8_t FF_MSK[] = {0xFF >>0, 0xFF >>0, 0xFF >>3, 
 0xFF >>4, 0xFF >>5, 0xFF >>0, 0xFF >>0, 0xFF >>0};
val = first & FF_MSK[len];

Some modern compilers can make additional optimizations if the pointers are known to not overlap - use restrict and const where applicable.

// int utf8_cp(char *txt, int32_t *ch)
int utf8_cp(const char * restrict txt, int32_t *restrict ch)

Coding the companion function would aid in testing for both functions.

int utf8_cp_encode(int32_t *ch, char *txt);

As code does not detect invalid encoding like surrogates, redundant patterns and values above max_Unicode, I see little value in handling only a subset of invalid sequences. Either detect them all (maybe in debug mode) or skip detection.

Suggest doing a 32-byte (or 256-byte) lookup for performance. Profile to find optimal.

// len = (first > 0) * (1 + ((first & 0xC0) == 0xC0) * LEN[(first >> 3) & 7]);
len = (first > 0) * LEN_32[first >> 3];
// or
len = LEN_256[first];

Could extend the above to do one lookup for both the len and val.

Question 4

Thank you, chux. I tried to do the bare minimum to allow each invalid encoding as a sequence of 1-byte "character". I could change the test if ((*++txt & 0xC0) != 0x80) in if (*++txt) but this would mean that an invalid sequence would eat up more characters. Is there any portion of code I should look at?

Question 5

@Remo.D Note: Your code allows some invalid encoding as a sequence of 1-byte, but not all. " Is there any portion of code I should look at?" --> It is work to find where code is wrong - need tests. Step 1 Try utf8_cp() all valid Unicode values, [0-0xD7FF] [E000-10FFFF] and see that they round-trip back to there original value noting the intermediate 1-4 byte UTF8 encoding. Step 2 Use the UTF8 values to mark off a 32-bit flag array considering UTF8 sequence like A or x41 marks off all x41 xx xx xx position. Remaining unmarked values should be detected by your code as invalid.

chux chux 36.4k2 gold badges43 silver badges96 bronze badges · Accepted Answer · 2016-09-21 01:44:47Z

Rather than use narrow types, use fastest ones

// uint8_t first = (uint8_t)(*txt);
unsigned first = (uint8_t)(*txt);
// or
uint_fast8_t first = (uint8_t)(*txt);

Rather than lookup a value to shift, look up the shifted value.

// static uint8_t MSK[] = {0,0,3,4,5,0,0,0};
// val = first & (0xFF >> MSK[len]);
static const uint8_t FF_MSK[] = {0xFF >>0, 0xFF >>0, 0xFF >>3, 
 0xFF >>4, 0xFF >>5, 0xFF >>0, 0xFF >>0, 0xFF >>0};
val = first & FF_MSK[len];

Some modern compilers can make additional optimizations if the pointers are known to not overlap - use restrict and const where applicable.

// int utf8_cp(char *txt, int32_t *ch)
int utf8_cp(const char * restrict txt, int32_t *restrict ch)

Coding the companion function would aid in testing for both functions.

int utf8_cp_encode(int32_t *ch, char *txt);

As code does not detect invalid encoding like surrogates, redundant patterns and values above max_Unicode, I see little value in handling only a subset of invalid sequences. Either detect them all (maybe in debug mode) or skip detection.

Suggest doing a 32-byte (or 256-byte) lookup for performance. Profile to find optimal.

// len = (first > 0) * (1 + ((first & 0xC0) == 0xC0) * LEN[(first >> 3) & 7]);
len = (first > 0) * LEN_32[first >> 3];
// or
len = LEN_256[first];

Could extend the above to do one lookup for both the len and val.

Thank you, chux. I tried to do the bare minimum to allow each invalid encoding as a sequence of 1-byte "character". I could change the test if ((*++txt & 0xC0) != 0x80) in if (*++txt) but this would mean that an invalid sequence would eat up more characters. Is there any portion of code I should look at?
@Remo.D Note: Your code allows some invalid encoding as a sequence of 1-byte, but not all. " Is there any portion of code I should look at?" --> It is work to find where code is wrong - need tests. Step 1 Try utf8_cp() all valid Unicode values, [0-0xD7FF] [E000-10FFFF] and see that they round-trip back to there original value noting the intermediate 1-4 byte UTF8 encoding. Step 2 Use the UTF8 values to mark off a 32-bit flag array considering UTF8 sequence like A or x41 marks off all x41 xx xx xx position. Remaining unmarked values should be detected by your code as invalid.

Stack Exchange Network

UTF8 Codepoint decode and length

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

UTF8 Codepoint decode and length

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions