I've recently solved the "UTF-8 Validation" LeetCode problem:
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
For 1-byte character, the first bit is a 0, followed by its unicode code.
For n-bytes character, the first n-bits are all one's, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.
This is how the UTF-8 encoding would work:
Char. number range | UTF-8 octet sequence (hexadecimal) | (binary) --------------------+--------------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Given an array of integers representing the data, return whether it is a valid utf-8 encoding.
Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
Example 1:
data = [197, 130, 1]
, which represents the octet sequence:11000101 10000010 00000001
.Return true. It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.
Example 2:
data = [235, 140, 4]
, which represented the octet sequence:11101011 10001100 00000100
.Return false. The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character. The next byte is a continuation byte which starts with 10 and that's correct. But the second continuation byte does not start with 10, so it is invalid.
The code works and was accepted by the OJ:
NUMBER_OF_BITS_PER_BLOCK = 8
MAX_NUMBER_OF_ONES = 4
class Solution(object):
def validUtf8(self, data):
"""
:type data: List[int]
:rtype: bool
"""
index = 0
while index < len(data):
number = data[index] & (2 ** 7)
number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
if number == 0: # single byte char
index += 1
continue
# validate multi-byte char
number_of_ones = 0
while True: # get the number of significant ones
number = data[index] & (2 ** (7 - number_of_ones))
number >>= (NUMBER_OF_BITS_PER_BLOCK - number_of_ones - 1)
if number == 1:
number_of_ones += 1
else:
break
if number_of_ones > MAX_NUMBER_OF_ONES:
return False # too much ones per char sequence
if number_of_ones == 1:
return False # there has to be at least 2 ones
index += 1 # move on to check the next byte in a multi-byte char sequence
# check for out of bounds and exit early
if index >= len(data) or index >= (index + number_of_ones - 1):
return False
# every next byte has to start with "10"
for i in range(index, index + number_of_ones - 1):
number = data[i]
number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
if number != 1:
return False
number >>= (NUMBER_OF_BITS_PER_BLOCK - 1)
if number != 0:
return False
index += 1
return True
I've always being struggling to remember the bit manipulation tricks and tried to solve this problem without looking them up - hence, I think the code is overloaded with left and right shifts and power of two multiplications.
Am I overcomplicating the problem? What would you improve in the proposed solution? Is there a better way?
-
3\$\begingroup\$ Once you are happy with your solution, add the other requirements: 1. Every code point must be represented in the shortest possible way, for example code points ≤ 127 must not use two bytes. 2. Code points 0xd800 to 0xdfff are invalid. 3. Code points above 0x10ffff are invalid. \$\endgroup\$gnasher729– gnasher7292017年04月04日 19:58:29 +00:00Commented Apr 4, 2017 at 19:58
-
1\$\begingroup\$ I think you want to avoid the world overload because it usually means something else in the context of writing program code \$\endgroup\$Nayuki– Nayuki2017年04月05日 03:21:18 +00:00Commented Apr 5, 2017 at 3:21
-
\$\begingroup\$ It's good idea to avoid the world overload \$\endgroup\$NinjaG– NinjaG2017年12月19日 17:57:12 +00:00Commented Dec 19, 2017 at 17:57
3 Answers 3
This is much more complicated then if you were to use an iterator of bits. Changing a number to a list of bits, is really easy and simple, creating an iterator is also fairly simple too. This can be done by:
def to_bits(bytes):
for byte in bytes:
num = []
exp = 1 << NUMBER_OF_BITS_PER_BLOCK
while exp:
exp >>= 1
num.append(bool(byte & exp))
yield num
After this you just need to do the checks in the question, the way the question is laid out. First you check if it's a correct single byte, that you done. After this you can find the amount that you need to loop through, do the checks on this number, and go on to loop through minus one items, and check if they're ok.
This can fairly easily be changed to use bitwise operators, rather than slice operators, but I find them harder to understand.
class Solution(object):
def validUtf8(self, data):
"""
:type data: List[int]
:rtype: bool
"""
bits = to_bits(data)
for byte in bits:
# single byte char
if byte[0] == 0:
continue
# multi-byte character
amount = sum(takewhile(bool, byte))
if amount <= 1:
return False
if amount >= MAX_NUMBER_OF_ONES:
return False
for _ in range(amount - 1):
try:
byte = next(bits)
except StopIteration:
return False
if byte[0:2] != [1, 0]:
return False
return True
-
\$\begingroup\$ Such a powerful idea to generate lists of bits! And, the
takewhile
is very appropriate. Thanks so much, learned a lot of new things. \$\endgroup\$alecxe– alecxe2017年04月04日 17:41:25 +00:00Commented Apr 4, 2017 at 17:41
Bugs
I'm surprised that the code was accepted by the online judge. This part of the code makes no sense to me:
# check for out of bounds and exit early if index >= len(data) or index >= (index + number_of_ones - 1): return False # every next byte has to start with "10" for i in range(index, index + number_of_ones - 1): number = data[i] number >>= (NUMBER_OF_BITS_PER_BLOCK - 1) if number != 1: return False number >>= (NUMBER_OF_BITS_PER_BLOCK - 1) if number != 0: return False index += 1
What does index >= (index + number_of_ones - 1)
mean? It's the same as number_of_ones <= 1
. That is never going happen, since you have already weeded out the # single byte char
and # there has to be at least 2 ones
cases earlier.
So, you are effectively checking only if index >= len(data): return False
. That verifies that the data
are long enough to contain one trailing byte, but it does not ensure that it is long enough to contain all of the expected trailing bytes. That is, number = data[i]
could crash if data
is a prematurely truncated sequence.
I'm displeased with the way you reassign number
with number >>= ...
here and in the # get the number of significant ones
loop. Such mutation makes your code fragile and hard to understand.
This loop does not actually validate that the trailing bytes start with "10"
, as the comment claims. You do verify that the most-significant bit is 1, with if number != 1: return False
. That's all. If you right-shift number
(which is an integer ≤ 255) by 14 bits, then of course you will always get 0
!
Elegance
Looping by keeping incrementing an index
is awkward in Python: you have index = 0
, while index < len(data)
, and several index += 1
. Python does offer more expressive iteration techniques, and here I would recommend using an iterator.
The code would be more readable if the # get the number of significant ones
loop were moved into a helper function.
I don't think that you need a special case for # single byte char
. Just treat it as if you expect zero trailing bytes.
I would rename number
to byte
, because that's what it represents.
A docstring with a doctest would be appropriate for this function.
Suggested solution
class Solution(object):
def validUtf8(self, data):
"""
Check that a sequence of byte values follows the UTF-8 encoding
rules. Does not check for canonicalization (i.e. overlong encodings
are acceptable).
>>> s = Solution()
>>> s.validUtf8([197, 130, 1])
True
>>> s.validUtf8([235, 140, 4])
False
"""
data = iter(data)
for leading_byte in data:
leading_ones = self._count_leading_ones(leading_byte)
if leading_ones in [1, 7, 8]:
return False # Illegal leading byte
for _ in range(leading_ones - 1):
trailing_byte = next(data, None)
if trailing_byte is None or trailing_byte >> 6 != 0b10:
return False # Missing or illegal trailing byte
return True
@staticmethod
def _count_leading_ones(byte):
for i in range(8):
if byte >> (7 - i) == 0b11111111 >> (7 - i) & ~1:
return i
return 8
-
\$\begingroup\$ Absolutely valid points, omg, I don't know what I was thinking when writing this awkward if statement. And, yes, I also hated to redefine the
number
. Thank you for the very concise and Pythonic solution, definitely learned a lot from all 3 answers today. \$\endgroup\$alecxe– alecxe2017年04月04日 22:41:29 +00:00Commented Apr 4, 2017 at 22:41 -
4\$\begingroup\$ Consider
if leading_ones in (1, 7, 8):
to avoid allocating a new list for each byte in the data. Also, consider decorating_count_leading_ones
with@functools.lru_cache(maxsize=256)
to avoid repeated work. \$\endgroup\$Gareth Rees– Gareth Rees2017年04月05日 15:11:54 +00:00Commented Apr 5, 2017 at 15:11
I think yes, you are over-complicating.
Bit-wise is used to squeeze every last bit of performance from a machine, but using it an interpreted language like Python kinda of defeats the whole point.
If you want to make this fast you should use C, if you want to use Python you might as well make this idiomatic and not worry about performance, so you can use this script to Oracle test the C script you write.
In short using the built-in tools of string manipulation makes this problem simple:
def valid_unicode(blocks):
"""
A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:
For 1-byte character, the first bit is a 0, followed by its unicode code.
For n-bytes character, the first n-bits are all one's, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.
Given an array of integers representing the data, return whether it is a valid utf-8 encoding.
Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.
>>> valid_unicode([197, 130, 1])
True
>>> valid_unicode([235, 140, 4])
False
"""
successive_10 = 0
for b in blocks:
b = bin(b).replace('0b','').rjust(8, '0')
if successive_10 != 0:
successive_10 -= 1
if not b.startswith('10'):
return False
elif b[0] == '1':
successive_10 = len(b.split('0')[0]) - 1
return True
Just 10 lines of logic instead of 31, and surprisingly (probably because your function contains some redundant checks) my version is also a little bit faster:
# valid_unicode2 is your function
print( timeit.timeit(lambda: valid_unicode2([197, 130, 1])))
print( timeit.timeit(lambda: valid_unicode([197, 130, 1])) )
Gives:
6.9072500330003095
6.539016634000291
So use the right tool for the right job, Python is not C and what is fastest is not what you might expect (so you should not write in Python like you would write in C), and it is better to use it as a prototype language or where performance is not critical.
-
2\$\begingroup\$ I'm not a fan of stringifying the number to do the bit analysis. \$\endgroup\$200_success– 200_success2017年04月04日 18:11:50 +00:00Commented Apr 4, 2017 at 18:11
-
2\$\begingroup\$ @200_success in a production system this would be done in bitwise in C, this is just a prototype to better understand the problem and develop the efficient code \$\endgroup\$Caridorc– Caridorc2017年04月04日 18:17:02 +00:00Commented Apr 4, 2017 at 18:17
-
\$\begingroup\$ @Downvoter: please do read the comment on how this is only a prototype, not the most efficient way \$\endgroup\$Caridorc– Caridorc2017年04月07日 20:18:19 +00:00Commented Apr 7, 2017 at 20:18