UTF-8 Validation

Question 1

I've recently solved the "UTF-8 Validation" LeetCode problem:

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

For 1-byte character, the first bit is a 0, followed by its unicode code.

For n-bytes character, the first n-bits are all one's, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.

This is how the UTF-8 encoding would work:
Char. number range | UTF-8 octet sequence
 (hexadecimal) | (binary)
 --------------------+---------------------------------------------
 0000 0000-0000 007F | 0xxxxxxx
 0000 0080-0000 07FF | 110xxxxx 10xxxxxx
 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Given an array of integers representing the data, return whether it is a valid utf-8 encoding.

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Example 1:

data = [197, 130, 1], which represents the octet sequence: 11000101 10000010 00000001.

Return true. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Example 2:

data = [235, 140, 4], which represented the octet sequence: 11101011 10001100 00000100.

Return false. The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character. The next byte is a continuation byte which starts with 10 and that's correct. But the second continuation byte does not start with 10, so it is invalid.

The code works and was accepted by the OJ:

NUMBER_OF_BITS_PER_BLOCK = 8
MAX_NUMBER_OF_ONES = 4
class Solution(object):
 def validUtf8(self, data):
 """
 :type data: List[int]
 :rtype: bool
 """
 index = 0
 while index < len(data):
 number = data[index] & (2 ** 7)
 number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
 if number == 0: # single byte char
 index += 1
 continue
 # validate multi-byte char
 number_of_ones = 0
 while True: # get the number of significant ones
 number = data[index] & (2 ** (7 - number_of_ones))
 number >>= (NUMBER_OF_BITS_PER_BLOCK - number_of_ones - 1)
 if number == 1:
 number_of_ones += 1
 else:
 break
 if number_of_ones > MAX_NUMBER_OF_ONES:
 return False # too much ones per char sequence
 if number_of_ones == 1:
 return False # there has to be at least 2 ones
 index += 1 # move on to check the next byte in a multi-byte char sequence
 # check for out of bounds and exit early
 if index >= len(data) or index >= (index + number_of_ones - 1):
 return False 
 # every next byte has to start with "10"
 for i in range(index, index + number_of_ones - 1):
 number = data[i]
 number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
 if number != 1:
 return False
 number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
 if number != 0:
 return False
 index += 1
 return True

I've always being struggling to remember the bit manipulation tricks and tried to solve this problem without looking them up - hence, I think the code is overloaded with left and right shifts and power of two multiplications.

Am I overcomplicating the problem? What would you improve in the proposed solution? Is there a better way?

Question 2

Once you are happy with your solution, add the other requirements: 1. Every code point must be represented in the shortest possible way, for example code points ≤ 127 must not use two bytes. 2. Code points 0xd800 to 0xdfff are invalid. 3. Code points above 0x10ffff are invalid.

Question 3

I think you want to avoid the world overload because it usually means something else in the context of writing program code

Question 4

It's good idea to avoid the world overload

Question 5

This is much more complicated then if you were to use an iterator of bits. Changing a number to a list of bits, is really easy and simple, creating an iterator is also fairly simple too. This can be done by:

def to_bits(bytes):
 for byte in bytes:
 num = []
 exp = 1 << NUMBER_OF_BITS_PER_BLOCK
 while exp:
 exp >>= 1
 num.append(bool(byte & exp))
 yield num

After this you just need to do the checks in the question, the way the question is laid out. First you check if it's a correct single byte, that you done. After this you can find the amount that you need to loop through, do the checks on this number, and go on to loop through minus one items, and check if they're ok.

This can fairly easily be changed to use bitwise operators, rather than slice operators, but I find them harder to understand.

class Solution(object):
 def validUtf8(self, data):
 """
 :type data: List[int]
 :rtype: bool
 """
 bits = to_bits(data)
 for byte in bits:
 # single byte char
 if byte[0] == 0:
 continue
 # multi-byte character
 amount = sum(takewhile(bool, byte))
 if amount <= 1:
 return False
 if amount >= MAX_NUMBER_OF_ONES:
 return False
 for _ in range(amount - 1):
 try:
 byte = next(bits)
 except StopIteration:
 return False
 if byte[0:2] != [1, 0]:
 return False
 return True

Question 6

Such a powerful idea to generate lists of bits! And, the takewhile is very appropriate. Thanks so much, learned a lot of new things.

Question 7

Bugs

I'm surprised that the code was accepted by the online judge. This part of the code makes no sense to me:

# check for out of bounds and exit early
if index >= len(data) or index >= (index + number_of_ones - 1):
 return False 
# every next byte has to start with "10"
for i in range(index, index + number_of_ones - 1):
 number = data[i]
 number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
 if number != 1:
 return False
 number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
 if number != 0:
 return False
 index += 1

What does index >= (index + number_of_ones - 1) mean? It's the same as number_of_ones <= 1. That is never going happen, since you have already weeded out the # single byte char and # there has to be at least 2 ones cases earlier.

So, you are effectively checking only if index >= len(data): return False. That verifies that the data are long enough to contain one trailing byte, but it does not ensure that it is long enough to contain all of the expected trailing bytes. That is, number = data[i] could crash if data is a prematurely truncated sequence.

I'm displeased with the way you reassign number with number >>= ... here and in the # get the number of significant ones loop. Such mutation makes your code fragile and hard to understand.

This loop does not actually validate that the trailing bytes start with "10", as the comment claims. You do verify that the most-significant bit is 1, with if number != 1: return False. That's all. If you right-shift number (which is an integer ≤ 255) by 14 bits, then of course you will always get 0!

Elegance

Looping by keeping incrementing an index is awkward in Python: you have index = 0, while index < len(data), and several index += 1. Python does offer more expressive iteration techniques, and here I would recommend using an iterator.

The code would be more readable if the # get the number of significant ones loop were moved into a helper function.

I don't think that you need a special case for # single byte char. Just treat it as if you expect zero trailing bytes.

I would rename number to byte, because that's what it represents.

A docstring with a doctest would be appropriate for this function.

Suggested solution

class Solution(object):
 def validUtf8(self, data):
 """
 Check that a sequence of byte values follows the UTF-8 encoding
 rules. Does not check for canonicalization (i.e. overlong encodings
 are acceptable).
 >>> s = Solution()
 >>> s.validUtf8([197, 130, 1])
 True
 >>> s.validUtf8([235, 140, 4])
 False
 """
 data = iter(data)
 for leading_byte in data:
 leading_ones = self._count_leading_ones(leading_byte)
 if leading_ones in [1, 7, 8]:
 return False # Illegal leading byte
 for _ in range(leading_ones - 1):
 trailing_byte = next(data, None)
 if trailing_byte is None or trailing_byte >> 6 != 0b10:
 return False # Missing or illegal trailing byte
 return True
 @staticmethod
 def _count_leading_ones(byte):
 for i in range(8):
 if byte >> (7 - i) == 0b11111111 >> (7 - i) & ~1:
 return i
 return 8

Question 8

Absolutely valid points, omg, I don't know what I was thinking when writing this awkward if statement. And, yes, I also hated to redefine the number. Thank you for the very concise and Pythonic solution, definitely learned a lot from all 3 answers today.

Question 9

Consider if leading_ones in (1, 7, 8): to avoid allocating a new list for each byte in the data. Also, consider decorating _count_leading_ones with @functools.lru_cache(maxsize=256) to avoid repeated work.

Question 10

I think yes, you are over-complicating.

Bit-wise is used to squeeze every last bit of performance from a machine, but using it an interpreted language like Python kinda of defeats the whole point.

If you want to make this fast you should use C, if you want to use Python you might as well make this idiomatic and not worry about performance, so you can use this script to Oracle test the C script you write.

In short using the built-in tools of string manipulation makes this problem simple:

def valid_unicode(blocks):
 """
 A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
 For 1-byte character, the first bit is a 0, followed by its unicode code.
 For n-bytes character, the first n-bits are all one's, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.
 Given an array of integers representing the data, return whether it is a valid utf-8 encoding.
 Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
 >>> valid_unicode([197, 130, 1])
 True
 >>> valid_unicode([235, 140, 4])
 False
 """
 successive_10 = 0
 for b in blocks:
 b = bin(b).replace('0b','').rjust(8, '0')
 if successive_10 != 0:
 successive_10 -= 1
 if not b.startswith('10'):
 return False
 elif b[0] == '1':
 successive_10 = len(b.split('0')[0]) - 1
 return True

Just 10 lines of logic instead of 31, and surprisingly (probably because your function contains some redundant checks) my version is also a little bit faster:

# valid_unicode2 is your function
print( timeit.timeit(lambda: valid_unicode2([197, 130, 1])))
print( timeit.timeit(lambda: valid_unicode([197, 130, 1])) )

Gives:

6.9072500330003095
6.539016634000291

So use the right tool for the right job, Python is not C and what is fastest is not what you might expect (so you should not write in Python like you would write in C), and it is better to use it as a prototype language or where performance is not critical.

Question 11

I'm not a fan of stringifying the number to do the bit analysis.

Question 12

@200_success in a production system this would be done in bitwise in C, this is just a prototype to better understand the problem and develop the efficient code

Question 13

@Downvoter: please do read the comment on how this is only a prototype, not the most efficient way

Peilonrayz ♦Peilonrayz 44.4k7 gold badges80 silver badges157 bronze badges · Accepted Answer · 2017-04-04 17:36:15Z

This is much more complicated then if you were to use an iterator of bits. Changing a number to a list of bits, is really easy and simple, creating an iterator is also fairly simple too. This can be done by:

def to_bits(bytes):
 for byte in bytes:
 num = []
 exp = 1 << NUMBER_OF_BITS_PER_BLOCK
 while exp:
 exp >>= 1
 num.append(bool(byte & exp))
 yield num

After this you just need to do the checks in the question, the way the question is laid out. First you check if it's a correct single byte, that you done. After this you can find the amount that you need to loop through, do the checks on this number, and go on to loop through minus one items, and check if they're ok.

This can fairly easily be changed to use bitwise operators, rather than slice operators, but I find them harder to understand.

class Solution(object):
 def validUtf8(self, data):
 """
 :type data: List[int]
 :rtype: bool
 """
 bits = to_bits(data)
 for byte in bits:
 # single byte char
 if byte[0] == 0:
 continue
 # multi-byte character
 amount = sum(takewhile(bool, byte))
 if amount <= 1:
 return False
 if amount >= MAX_NUMBER_OF_ONES:
 return False
 for _ in range(amount - 1):
 try:
 byte = next(bits)
 except StopIteration:
 return False
 if byte[0:2] != [1, 0]:
 return False
 return True

Such a powerful idea to generate lists of bits! And, the takewhile is very appropriate. Thanks so much, learned a lot of new things.

Stack Exchange Network

UTF-8 Validation

3 Answers 3

Bugs

Elegance

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

UTF-8 Validation

3 Answers 3

Bugs

Elegance

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions