- 145.5k
- 22
- 190
- 479
There are a lot of magic numbers. They are numbers that would be familiar to any good programmer, but magic numbers nonetheless. Furthermore, it's more useful to think of the problem using bitwise operations, since they are more closely related to the conceptual design behind UTF-8.
You only accept two-, three-, and four-byte sequences, i.e., only code points U+0080 to U+1FFFFF, i.e., the Basic Multilingual Plane (minus the ASCII set) plus the Supplementary Multilingual Plane. That should be made clear in comments.
I suggest renaming IS_UTF8_BYTE()
to IS_UTF8_MULTIBYTE()
, since ASCII characters ≤ 127 are valid UTF-8 bytes too — just not members of a multibyte sequence.
//check if value is in range of leading byte (0b11?????? but not 0b11111???)
#define IS_UTF8_LEADING_BYTE(b) (((b) & 0xc0) && !((b) & 0xf8))
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) (((b) & 0x80) && !((b) & 0xc0))
//can be any byte within a UTF-8 multibyte sequence
#define IS_UTF8_MULTIBYTE(b) (IS_UTF8_LEADING_BYTE(b) || IS_UTF8_SEQUENCE_BYTE(b))
//no error checking; it must be used only on leading byte
#define HOW_MANY_UTF8_SEQUENCE_BYTES(b) (((b) & 0x40 == 0x40) + ((b) & 0x20 == 0x20) + ((b) & 0x10 == 0x10))
IS_UTF8_SEQUENCE_BYTE()
could be more clever.
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) (((b ^ 0x80) >>& 6)0xc0 == 00x80)
There are a lot of magic numbers. They are numbers that would be familiar to any good programmer, but magic numbers nonetheless. Furthermore, it's more useful to think of the problem using bitwise operations, since they are more closely related to the conceptual design behind UTF-8.
You only accept two-, three-, and four-byte sequences, i.e., only code points U+0080 to U+1FFFFF, i.e., the Basic Multilingual Plane (minus the ASCII set) plus the Supplementary Multilingual Plane. That should be made clear in comments.
I suggest renaming IS_UTF8_BYTE()
to IS_UTF8_MULTIBYTE()
, since ASCII characters ≤ 127 are valid UTF-8 bytes too — just not members of a multibyte sequence.
//check if value is in range of leading byte (0b11?????? but not 0b11111???)
#define IS_UTF8_LEADING_BYTE(b) ((b & 0xc0) && !(b & 0xf8))
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) ((b & 0x80) && !(b & 0xc0))
//can be any byte within a UTF-8 multibyte sequence
#define IS_UTF8_MULTIBYTE(b) (IS_UTF8_LEADING_BYTE(b) || IS_UTF8_SEQUENCE_BYTE(b))
//no error checking; it must be used only on leading byte
#define HOW_MANY_UTF8_SEQUENCE_BYTES(b) ((b & 0x40 == 0x40) + (b & 0x20 == 0x20) + (b & 0x10 == 0x10))
IS_UTF8_SEQUENCE_BYTE()
could be more clever.
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) (((b ^ 0x80) >> 6) == 0)
There are a lot of magic numbers. They are numbers that would be familiar to any good programmer, but magic numbers nonetheless. Furthermore, it's more useful to think of the problem using bitwise operations, since they are more closely related to the conceptual design behind UTF-8.
You only accept two-, three-, and four-byte sequences, i.e., only code points U+0080 to U+1FFFFF, i.e., the Basic Multilingual Plane (minus the ASCII set) plus the Supplementary Multilingual Plane. That should be made clear in comments.
I suggest renaming IS_UTF8_BYTE()
to IS_UTF8_MULTIBYTE()
, since ASCII characters ≤ 127 are valid UTF-8 bytes too — just not members of a multibyte sequence.
//check if value is in range of leading byte (0b11?????? but not 0b11111???)
#define IS_UTF8_LEADING_BYTE(b) (((b) & 0xc0) && !((b) & 0xf8))
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) (((b) & 0x80) && !((b) & 0xc0))
//can be any byte within a UTF-8 multibyte sequence
#define IS_UTF8_MULTIBYTE(b) (IS_UTF8_LEADING_BYTE(b) || IS_UTF8_SEQUENCE_BYTE(b))
//no error checking; it must be used only on leading byte
#define HOW_MANY_UTF8_SEQUENCE_BYTES(b) (((b) & 0x40 == 0x40) + ((b) & 0x20 == 0x20) + ((b) & 0x10 == 0x10))
IS_UTF8_SEQUENCE_BYTE()
could be more clever.
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) ((b) & 0xc0 == 0x80)
There are a lot of magic numbers. They are numbers that would be familiar to any good programmer, but magic numbers nonetheless. Furthermore, it's more useful to think of the problem using bitwise operations, since they are more closely related to the conceptual design behind UTF-8.
You only accept two-, three-, and four-byte sequences, i.e., only code points U+0080 to U+1FFFFF, i.e., the Basic Multilingual Plane (minus the ASCII set) plus the Supplementary Multilingual Plane. That should be made clear in comments.
I suggest renaming IS_UTF8_BYTE()
to IS_UTF8_MULTIBYTE()
, since ASCII characters ≤ 127 are valid UTF-8 bytes too — just not members of a multibyte sequence.
//check if value is in range of leading byte (0b11?????? but not 0b11111???)
#define IS_UTF8_LEADING_BYTE(b) ((b & 0xc0) && !(b & 0xf8))
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) ((b & 0x80) && !(b & 0xc0))
//can be any byte within a UTF-8 multibyte sequence
#define IS_UTF8_MULTIBYTE(b) (IS_UTF8_LEADING_BYTE(b) || IS_UTF8_SEQUENCE_BYTE(b))
//no error checking; it must be used only on leading byte
#define HOW_MANY_UTF8_SEQUENCE_BYTES(b) ((b & 0x40 == 0x40) + (b & 0x20 == 0x20) + (b & 0x10 == 0x10))
IS_UTF8_SEQUENCE_BYTE()
could be more clever.
//check if value is in range of sequence byte (0b10??????)
#define IS_UTF8_SEQUENCE_BYTE(b) (((b ^ 0x80) >> 6) == 0)