VBA is a language that's lacking a lot of basic functionality. (Pun intended)
Most libraries, if they exist in the first place, are OS-specific, and even some of the inbuilt functions don't work on MacOS or have bugs in any environment.
Recently I've found that the StrConv(s, vbUnicode)
function transcoding a string to the VBA-internal UTF-16-LE encoding doesn't properly work in all environments and for all Unicode codepoints. Therefore, I simply implemented my own function for this task. (SO)
Since this is a fairly low-level task, I was trying to be efficient and to make the code as performant as possible by working on a single byte-array buffer and avoiding things like string concatenation which may be tempting here.
I ended up with this code which pretty much fits my needs. I'm sure though, there are still opportunities for improvement, and I have been asked to post it here for review so here you go.
Function DecodeUTF8(ByVal utf8str As String) As String
Dim utf8() As Byte, utf16() As Byte, i As Long, j As Long
Dim codePoint As Long, cpIndex As Long, numBytesOfCodePoint As Long
utf8 = utf8str
ReDim utf16(0 To (UBound(utf8) - LBound(utf8) + 1) * 2)
i = LBound(utf8): cpIndex = 0
Do While i <= UBound(utf8)
'Determine the number of bytes in the current UTF-8 codepoint
numBytesOfCodePoint = 1
If utf8(i) And &H80 Then
If utf8(i) And &H20 Then
If utf8(i) And &H10 Then
numBytesOfCodePoint = 4
Else: numBytesOfCodePoint = 3: End If
Else: numBytesOfCodePoint = 2: End If
End If
'Check if the UTF-8 codepoint is incomplete at the end of the string
If i + numBytesOfCodePoint - 1 > UBound(utf8) Then Err.Raise 5, _
"DecodeUtf8", _
"Incomplete UTF-8 codepoint at the end of the input string."
'Calculate the Unicode codepoint value from the UTF-8 bytes
If numBytesOfCodePoint = 1 Then
codePoint = utf8(i)
Else
codePoint = utf8(i) And (2 ^ (7 - numBytesOfCodePoint) - 1)
For j = 1 To numBytesOfCodePoint - 1
codePoint = (codePoint * 64) + (utf8(i + j) And &H3F)
Next j
End If
'Convert the Unicode codepoint to UTF-16LE bytes
If codePoint < &H10000 Then
utf16(cpIndex) = CByte(codePoint And &HFF&)
utf16(cpIndex + 1) = CByte(codePoint \ &H100&)
cpIndex = cpIndex + 2
Else 'Calculate surrogate pair
Dim m As Long, lowSurrogate As Long, highSurrogate As Long
m = codePoint - &H10000
'(m \ &H400&) = most significant 10 bits of m
highSurrogate = &HD800& + (m \ &H400&)
'(m And &H3FF) = least significant 10 bits of m
lowSurrogate = &HDC00& + (m And &H3FF)
'Concatenate highSurrogate and lowSurrogate as UTF-16LE bytes
utf16(cpIndex) = CByte(highSurrogate And &HFF&)
utf16(cpIndex + 1) = CByte(highSurrogate \ &H100&)
utf16(cpIndex + 2) = CByte(lowSurrogate And &HFF&)
utf16(cpIndex + 3) = CByte(lowSurrogate \ &H100&)
cpIndex = cpIndex + 4
End If
i = i + numBytesOfCodePoint 'Move to the next UTF-8 codepoint
Loop
ReDim Preserve utf16(cpIndex - 1)
DecodeUTF8 = utf16
End Function
1 Answer 1
Great work, as usual, definitely needed for Mac compatibility and nice as a learning exercise.
Buffer
I don't think there is a way to improve on the Byte buffer approach. After all, the maximum number of bytes would be twice as many if all the codepoints are less than 128 i.e. one byte per codepoint.
An idea would be to replace these 2 lines:
ReDim Preserve utf16(cpIndex - 1)
DecodeUTF8 = utf16
with:
DecodeUTF8 = MidB$(utf16, 1, cpIndex)
but with no performance gain as there is already a conversion going on when passing the array argument as string. We're just saving a line of code.
However, a good improvement would be to have the buffer as an Integer
array instead of Byte
because then all codepoints and surrogates would be simply assigned to the array with no need to mask with codePoint And &HFF&
or bit shift with codePoint \ &H100&
. However, this will require some CopyMemory
API calls. No time to do that now but I might return with a later edit.
Performance
Raising to power as in (2 ^ (7 - numBytesOfCodePoint) - 1)
is a costly operation. We could use a static array of pre-calculated powers of 2 which can be used as a lookup.
We could also use a static array to store the number of bytes needed per codepoint and remove the need to have many If
statements.
Somethig like:
Static byteCount(0 To 255) As Byte
Dim i As Long
If byteCount(0) = 0 Then
For i = &H0 To &H7F '0xxxxxxx
byteCount(i) = 1
Next i
For i = &HC0 To &HDF '110xxxxx
byteCount(i) = 2
Next i
For i = &HE0 To &HEF '1110xxxx
byteCount(i) = 3
Next i
For i = &HF0 To &HF7 '11110xxx
byteCount(i) = 4
Next i
End If
and now every first byte of each codepoint can be looked up in the array with something like:
bytesPerCode = byteCount(firstCPByte)
. A result of 0 would raise an invalid byte error.
Invalid sequence
You already have error handling in place in case the sequence is incomplete. However there are other issues that need to be addressed. This Wikipedia article lists a few:
- invalid bytes
- an unexpected continuation byte
- a non-continuation byte before the end of the character
- the string ending before the end of the character (which can happen in simple string truncation)
- an overlong encoding
- a sequence that decodes to an invalid code point
For example a byte value of 128 (&H80) should be flagged as invalid but in the current approach this passes as valid on the Else: numBytesOfCodePoint = 2: End If
branch.
Another example: a byte value of 11000010 (&HC2) denotes a 2-byte codepoint and the second byte should be of the form 10xxxxxx but in the current approach something like 11000000 passes as a valid continuation byte.
The static byteCount
array presented in the Performance section already covers nicely for invalid first bytes.
For 2 bytes per codepoint we can avoid overlong encoding by marking C0 and C1 as invalid directly in the static array i.e replace For i = &HC0 To &HDF
with For i = &HC2 To &HDF
because we know that 110xxxxx 10xxxxxx must already exclude anything below value 128.
For 3 and 4 bytes we can check for overlong encoding after codepoint has been calculated with something like:
If bytesPerCode = 3 And codePoint < &H800& Then
Err.Raise 5, methodName, "Overlong encoding"
ElseIf bytesPerCode = 4 And codePoint < &H10000 Then
Err.Raise 5, methodName, "Overlong encoding"
End If
Or, we could just have another static array with the accepted limits and call it like this:
If codePoint < minCodePoint(bytesPerCode) Then
Err.Raise 5, methodName, "Overlong encoding"
End If
Of course, we could cover 2 bytes per code point directly from the static array without needing to exclude C0 and C1 from the byteCount
static array as mentioned above.
Solution
The solution presented below performs about 35% faster than the original solution presented in the question and this is thanks to the lookups using the static arrays. I've renamed the function name as a personal preference although DecodeUTF8
is perfectly fine.
Public Function UTF8ToUTF16LE(ByRef utf8Text As String) As String
Const methodName As String = "UTF8ToUTF16LE"
Dim bytesCount As Long: bytesCount = LenB(utf8Text)
If bytesCount = 0 Then Exit Function
'
Dim utf8() As Byte: utf8 = utf8Text
Dim utf16() As Byte: ReDim utf16(0 To bytesCount * 2 - 1)
'
Static byteCount(0 To 255) As Long
Static mask(3 To 5) As Long
Static minCodePoint(2 To 4) As Long
Dim i As Long
'
If byteCount(0) = 0 Then
For i = &H0& To &H7F& '0xxxxxxx
byteCount(i) = 1
Next i
For i = &HC2& To &HDF& '110xxxxx - C0 and C1 are invalid (overlong encoding)
byteCount(i) = 2
Next i
For i = &HE0& To &HEF& '1110xxxx
byteCount(i) = 3
Next i
For i = &HF0& To &HF3& '11110xxx - 111101xx is not valid according to RFC 3629 to match UTF-16 limits
byteCount(i) = 4
Next i
For i = 3 To 5
mask(i) = 2 ^ i - 1
Next i
minCodePoint(2) = &H80&
minCodePoint(3) = &H800&
minCodePoint(4) = &H10000
End If
'
Dim codePoint As Long
Dim bytesPerCode As Long
Dim j As Long
Dim loByte As Byte
Dim bufferIndex As Long: bufferIndex = 0
'
i = 0
Do
codePoint = utf8(i)
bytesPerCode = byteCount(codePoint)
'
If bytesPerCode = 0 Then
Err.Raise 5, methodName, "Invalid byte"
ElseIf bytesPerCode = 1 Then
utf16(bufferIndex) = codePoint
bufferIndex = bufferIndex + 2
ElseIf i + bytesPerCode > bytesCount Then
Err.Raise 5, methodName, "Incomplete character"
Else
codePoint = codePoint And mask(7 - bytesPerCode)
For j = 1 To bytesPerCode - 1
loByte = utf8(i + j)
If (loByte And &HC0&) <> &H80& Then '10xxxxxx
Err.Raise 5, methodName, "Invalid continuation byte"
End If
codePoint = codePoint * &H40& + (loByte And &H3F&)
Next j
If codePoint < minCodePoint(bytesPerCode) Then
Err.Raise 5, methodName, "Overlong encoding"
End If
'
'Convert from Unicode codepoint to UTF-16LE bytes
If bytesPerCode < 4 Then
utf16(bufferIndex) = CByte(codePoint And &HFF&)
utf16(bufferIndex + 1) = CByte(codePoint \ &H100&)
bufferIndex = bufferIndex + 2
Else
Dim m As Long: m = codePoint - &H10000
Dim lowSurrogate As Long: lowSurrogate = &HDC00& + (m And &H3FF&)
Dim highSurrogate As Long: highSurrogate = &HD800& + (m \ &H400&)
'
utf16(bufferIndex) = CByte(highSurrogate And &HFF&)
utf16(bufferIndex + 1) = CByte(highSurrogate \ &H100&)
utf16(bufferIndex + 2) = CByte(lowSurrogate And &HFF&)
utf16(bufferIndex + 3) = CByte(lowSurrogate \ &H100&)
bufferIndex = bufferIndex + 4
End If
End If
i = i + bytesPerCode
Loop Until i >= bytesCount
UTF8ToUTF16LE = MidB$(utf16, 1, bufferIndex)
End Function
-
1\$\begingroup\$ Great points you are making! I knew the operations you mentioned were costly but I didn't think of the brilliant array trick! I was aware that my solution allows invalid encodings, I already knew the article you linked, but I thought it would be too complicated to account for all of that and also I expect only correct UTF-8 inputs for my use-case. I agree that for a general-purpose function, it should definitely be taken into account, and will update my SO answer. I might even add a BOM check. Regarding 16bit buffer: I prefer no API calls at a slight performance cost for better portability. \$\endgroup\$GWD– GWD2023年03月21日 17:53:16 +00:00Commented Mar 21, 2023 at 17:53