Return to Answer

Added parentheses. Improved IS_UTF8_SEQUENCE_BYTE().

edited Apr 12, 2014 at 0:13

145.5k
22
190
479

There are a lot of magic numbers. They are numbers that would be familiar to any good programmer, but magic numbers nonetheless. Furthermore, it's more useful to think of the problem using bitwise operations, since they are more closely related to the conceptual design behind UTF-8.

You only accept two-, three-, and four-byte sequences, i.e., only code points U+0080 to U+1FFFFF, i.e., the Basic Multilingual Plane (minus the ASCII set) plus the Supplementary Multilingual Plane. That should be made clear in comments.

I suggest renaming IS_UTF8_BYTE() to IS_UTF8_MULTIBYTE(), since ASCII characters ≤ 127 are valid UTF-8 bytes too — just not members of a multibyte sequence.

//check if value is in range of leading byte (0b11?????? but not 0b11111???)
#define IS_UTF8_LEADING_BYTE(b) (((b) & 0xc0) && !((b) & 0xf8))
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) (((b) & 0x80) && !((b) & 0xc0))
//can be any byte within a UTF-8 multibyte sequence
#define IS_UTF8_MULTIBYTE(b) (IS_UTF8_LEADING_BYTE(b) || IS_UTF8_SEQUENCE_BYTE(b))
//no error checking; it must be used only on leading byte
#define HOW_MANY_UTF8_SEQUENCE_BYTES(b) (((b) & 0x40 == 0x40) + ((b) & 0x20 == 0x20) + ((b) & 0x10 == 0x10))

IS_UTF8_SEQUENCE_BYTE() could be more clever.

//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) (((b ^ 0x80) >>& 6)0xc0 == 00x80)

I suggest renaming IS_UTF8_BYTE() to IS_UTF8_MULTIBYTE(), since ASCII characters ≤ 127 are valid UTF-8 bytes too — just not members of a multibyte sequence.

//check if value is in range of leading byte (0b11?????? but not 0b11111???)
#define IS_UTF8_LEADING_BYTE(b) ((b & 0xc0) && !(b & 0xf8))
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) ((b & 0x80) && !(b & 0xc0))
//can be any byte within a UTF-8 multibyte sequence
#define IS_UTF8_MULTIBYTE(b) (IS_UTF8_LEADING_BYTE(b) || IS_UTF8_SEQUENCE_BYTE(b))
//no error checking; it must be used only on leading byte
#define HOW_MANY_UTF8_SEQUENCE_BYTES(b) ((b & 0x40 == 0x40) + (b & 0x20 == 0x20) + (b & 0x10 == 0x10))

IS_UTF8_SEQUENCE_BYTE() could be more clever.

//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) (((b ^ 0x80) >> 6) == 0)

I suggest renaming IS_UTF8_BYTE() to IS_UTF8_MULTIBYTE(), since ASCII characters ≤ 127 are valid UTF-8 bytes too — just not members of a multibyte sequence.

//check if value is in range of leading byte (0b11?????? but not 0b11111???)
#define IS_UTF8_LEADING_BYTE(b) (((b) & 0xc0) && !((b) & 0xf8))
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) (((b) & 0x80) && !((b) & 0xc0))
//can be any byte within a UTF-8 multibyte sequence
#define IS_UTF8_MULTIBYTE(b) (IS_UTF8_LEADING_BYTE(b) || IS_UTF8_SEQUENCE_BYTE(b))
//no error checking; it must be used only on leading byte
#define HOW_MANY_UTF8_SEQUENCE_BYTES(b) (((b) & 0x40 == 0x40) + ((b) & 0x20 == 0x20) + ((b) & 0x10 == 0x10))

IS_UTF8_SEQUENCE_BYTE() could be more clever.

//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) ((b) & 0xc0 == 0x80)

Source Link

answered Apr 11, 2014 at 18:03

200_success

answered Apr 11, 2014 at 18:03

200_success

145.5k
22
190
479

I suggest renaming IS_UTF8_BYTE() to IS_UTF8_MULTIBYTE(), since ASCII characters ≤ 127 are valid UTF-8 bytes too — just not members of a multibyte sequence.

//check if value is in range of leading byte (0b11?????? but not 0b11111???)
#define IS_UTF8_LEADING_BYTE(b) ((b & 0xc0) && !(b & 0xf8))
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) ((b & 0x80) && !(b & 0xc0))
//can be any byte within a UTF-8 multibyte sequence
#define IS_UTF8_MULTIBYTE(b) (IS_UTF8_LEADING_BYTE(b) || IS_UTF8_SEQUENCE_BYTE(b))
//no error checking; it must be used only on leading byte
#define HOW_MANY_UTF8_SEQUENCE_BYTES(b) ((b & 0x40 == 0x40) + (b & 0x20 == 0x20) + (b & 0x10 == 0x10))

IS_UTF8_SEQUENCE_BYTE() could be more clever.

//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) (((b ^ 0x80) >> 6) == 0)

lang-c