Transcoding UTF-8 to UTF-16-LE in VBA

Question 1

VBA is a language that's lacking a lot of basic functionality. (Pun intended)
Most libraries, if they exist in the first place, are OS-specific, and even some of the inbuilt functions don't work on MacOS or have bugs in any environment.

Recently I've found that the StrConv(s, vbUnicode) function transcoding a string to the VBA-internal UTF-16-LE encoding doesn't properly work in all environments and for all Unicode codepoints. Therefore, I simply implemented my own function for this task. (SO)

Since this is a fairly low-level task, I was trying to be efficient and to make the code as performant as possible by working on a single byte-array buffer and avoiding things like string concatenation which may be tempting here.

I ended up with this code which pretty much fits my needs. I'm sure though, there are still opportunities for improvement, and I have been asked to post it here for review so here you go.

Function DecodeUTF8(ByVal utf8str As String) As String
 Dim utf8() As Byte, utf16() As Byte, i As Long, j As Long
 Dim codePoint As Long, cpIndex As Long, numBytesOfCodePoint As Long
 utf8 = utf8str
 ReDim utf16(0 To (UBound(utf8) - LBound(utf8) + 1) * 2)
 i = LBound(utf8): cpIndex = 0
 Do While i <= UBound(utf8)
 'Determine the number of bytes in the current UTF-8 codepoint
 numBytesOfCodePoint = 1
 If utf8(i) And &H80 Then
 If utf8(i) And &H20 Then
 If utf8(i) And &H10 Then
 numBytesOfCodePoint = 4
 Else: numBytesOfCodePoint = 3: End If
 Else: numBytesOfCodePoint = 2: End If
 End If
 
 'Check if the UTF-8 codepoint is incomplete at the end of the string
 If i + numBytesOfCodePoint - 1 > UBound(utf8) Then Err.Raise 5, _
 "DecodeUtf8", _
 "Incomplete UTF-8 codepoint at the end of the input string."
 
 'Calculate the Unicode codepoint value from the UTF-8 bytes
 If numBytesOfCodePoint = 1 Then
 codePoint = utf8(i)
 Else
 codePoint = utf8(i) And (2 ^ (7 - numBytesOfCodePoint) - 1)
 For j = 1 To numBytesOfCodePoint - 1
 codePoint = (codePoint * 64) + (utf8(i + j) And &H3F)
 Next j
 End If
 
 'Convert the Unicode codepoint to UTF-16LE bytes
 If codePoint < &H10000 Then
 utf16(cpIndex) = CByte(codePoint And &HFF&)
 utf16(cpIndex + 1) = CByte(codePoint \ &H100&)
 cpIndex = cpIndex + 2
 Else 'Calculate surrogate pair
 Dim m As Long, lowSurrogate As Long, highSurrogate As Long
 m = codePoint - &H10000
 '(m \ &H400&) = most significant 10 bits of m
 highSurrogate = &HD800& + (m \ &H400&)
 '(m And &H3FF) = least significant 10 bits of m
 lowSurrogate = &HDC00& + (m And &H3FF)
 'Concatenate highSurrogate and lowSurrogate as UTF-16LE bytes
 utf16(cpIndex) = CByte(highSurrogate And &HFF&)
 utf16(cpIndex + 1) = CByte(highSurrogate \ &H100&)
 utf16(cpIndex + 2) = CByte(lowSurrogate And &HFF&)
 utf16(cpIndex + 3) = CByte(lowSurrogate \ &H100&)
 cpIndex = cpIndex + 4
 End If
 i = i + numBytesOfCodePoint 'Move to the next UTF-8 codepoint
 Loop
 ReDim Preserve utf16(cpIndex - 1)
 DecodeUTF8 = utf16
End Function

Question 2

Great work, as usual, definitely needed for Mac compatibility and nice as a learning exercise.

Buffer

I don't think there is a way to improve on the Byte buffer approach. After all, the maximum number of bytes would be twice as many if all the codepoints are less than 128 i.e. one byte per codepoint.

An idea would be to replace these 2 lines:

ReDim Preserve utf16(cpIndex - 1)
DecodeUTF8 = utf16

with:

DecodeUTF8 = MidB$(utf16, 1, cpIndex)

but with no performance gain as there is already a conversion going on when passing the array argument as string. We're just saving a line of code.

However, a good improvement would be to have the buffer as an Integer array instead of Byte because then all codepoints and surrogates would be simply assigned to the array with no need to mask with codePoint And &HFF& or bit shift with codePoint \ &H100&. However, this will require some CopyMemory API calls. No time to do that now but I might return with a later edit.

Performance

Raising to power as in (2 ^ (7 - numBytesOfCodePoint) - 1) is a costly operation. We could use a static array of pre-calculated powers of 2 which can be used as a lookup.

We could also use a static array to store the number of bytes needed per codepoint and remove the need to have many If statements.

Somethig like:

Static byteCount(0 To 255) As Byte
Dim i As Long
If byteCount(0) = 0 Then
 For i = &H0 To &H7F '0xxxxxxx
 byteCount(i) = 1
 Next i
 For i = &HC0 To &HDF '110xxxxx
 byteCount(i) = 2
 Next i
 For i = &HE0 To &HEF '1110xxxx
 byteCount(i) = 3
 Next i
 For i = &HF0 To &HF7 '11110xxx
 byteCount(i) = 4
 Next i
End If

and now every first byte of each codepoint can be looked up in the array with something like:
bytesPerCode = byteCount(firstCPByte). A result of 0 would raise an invalid byte error.

Invalid sequence

You already have error handling in place in case the sequence is incomplete. However there are other issues that need to be addressed. This Wikipedia article lists a few:

invalid bytes

an unexpected continuation byte

a non-continuation byte before the end of the character

the string ending before the end of the character (which can happen in simple string truncation)

an overlong encoding

a sequence that decodes to an invalid code point

For example a byte value of 128 (&H80) should be flagged as invalid but in the current approach this passes as valid on the Else: numBytesOfCodePoint = 2: End If branch.

Another example: a byte value of 11000010 (&HC2) denotes a 2-byte codepoint and the second byte should be of the form 10xxxxxx but in the current approach something like 11000000 passes as a valid continuation byte.

The static byteCount array presented in the Performance section already covers nicely for invalid first bytes.

For 2 bytes per codepoint we can avoid overlong encoding by marking C0 and C1 as invalid directly in the static array i.e replace For i = &HC0 To &HDF with For i = &HC2 To &HDF because we know that 110xxxxx 10xxxxxx must already exclude anything below value 128.

For 3 and 4 bytes we can check for overlong encoding after codepoint has been calculated with something like:

If bytesPerCode = 3 And codePoint < &H800& Then
 Err.Raise 5, methodName, "Overlong encoding"
ElseIf bytesPerCode = 4 And codePoint < &H10000 Then
 Err.Raise 5, methodName, "Overlong encoding"
End If

Or, we could just have another static array with the accepted limits and call it like this:

If codePoint < minCodePoint(bytesPerCode) Then
 Err.Raise 5, methodName, "Overlong encoding"
End If

Of course, we could cover 2 bytes per code point directly from the static array without needing to exclude C0 and C1 from the byteCount static array as mentioned above.

Solution

The solution presented below performs about 35% faster than the original solution presented in the question and this is thanks to the lookups using the static arrays. I've renamed the function name as a personal preference although DecodeUTF8 is perfectly fine.

Public Function UTF8ToUTF16LE(ByRef utf8Text As String) As String
 Const methodName As String = "UTF8ToUTF16LE"
 Dim bytesCount As Long: bytesCount = LenB(utf8Text)
 If bytesCount = 0 Then Exit Function
 '
 Dim utf8() As Byte: utf8 = utf8Text
 Dim utf16() As Byte: ReDim utf16(0 To bytesCount * 2 - 1)
 '
 Static byteCount(0 To 255) As Long
 Static mask(3 To 5) As Long
 Static minCodePoint(2 To 4) As Long
 Dim i As Long
 '
 If byteCount(0) = 0 Then
 For i = &H0& To &H7F& '0xxxxxxx
 byteCount(i) = 1
 Next i
 For i = &HC2& To &HDF& '110xxxxx - C0 and C1 are invalid (overlong encoding)
 byteCount(i) = 2
 Next i
 For i = &HE0& To &HEF& '1110xxxx
 byteCount(i) = 3
 Next i
 For i = &HF0& To &HF3& '11110xxx - 111101xx is not valid according to RFC 3629 to match UTF-16 limits
 byteCount(i) = 4
 Next i
 For i = 3 To 5
 mask(i) = 2 ^ i - 1
 Next i
 minCodePoint(2) = &H80&
 minCodePoint(3) = &H800&
 minCodePoint(4) = &H10000
 End If
 '
 Dim codePoint As Long
 Dim bytesPerCode As Long
 Dim j As Long
 Dim loByte As Byte
 Dim bufferIndex As Long: bufferIndex = 0
 '
 i = 0
 Do
 codePoint = utf8(i)
 bytesPerCode = byteCount(codePoint)
 '
 If bytesPerCode = 0 Then
 Err.Raise 5, methodName, "Invalid byte"
 ElseIf bytesPerCode = 1 Then
 utf16(bufferIndex) = codePoint
 bufferIndex = bufferIndex + 2
 ElseIf i + bytesPerCode > bytesCount Then
 Err.Raise 5, methodName, "Incomplete character"
 Else
 codePoint = codePoint And mask(7 - bytesPerCode)
 For j = 1 To bytesPerCode - 1
 loByte = utf8(i + j)
 If (loByte And &HC0&) <> &H80& Then '10xxxxxx
 Err.Raise 5, methodName, "Invalid continuation byte"
 End If
 codePoint = codePoint * &H40& + (loByte And &H3F&)
 Next j
 If codePoint < minCodePoint(bytesPerCode) Then
 Err.Raise 5, methodName, "Overlong encoding"
 End If
 '
 'Convert from Unicode codepoint to UTF-16LE bytes
 If bytesPerCode < 4 Then
 utf16(bufferIndex) = CByte(codePoint And &HFF&)
 utf16(bufferIndex + 1) = CByte(codePoint \ &H100&)
 bufferIndex = bufferIndex + 2
 Else
 Dim m As Long: m = codePoint - &H10000
 Dim lowSurrogate As Long: lowSurrogate = &HDC00& + (m And &H3FF&)
 Dim highSurrogate As Long: highSurrogate = &HD800& + (m \ &H400&)
 '
 utf16(bufferIndex) = CByte(highSurrogate And &HFF&)
 utf16(bufferIndex + 1) = CByte(highSurrogate \ &H100&)
 utf16(bufferIndex + 2) = CByte(lowSurrogate And &HFF&)
 utf16(bufferIndex + 3) = CByte(lowSurrogate \ &H100&)
 bufferIndex = bufferIndex + 4
 End If
 End If
 i = i + bytesPerCode
 Loop Until i >= bytesCount
 UTF8ToUTF16LE = MidB$(utf16, 1, bufferIndex)
End Function

Question 3

Great points you are making! I knew the operations you mentioned were costly but I didn't think of the brilliant array trick! I was aware that my solution allows invalid encodings, I already knew the article you linked, but I thought it would be too complicated to account for all of that and also I expect only correct UTF-8 inputs for my use-case. I agree that for a general-purpose function, it should definitely be taken into account, and will update my SO answer. I might even add a BOM check. Regarding 16bit buffer: I prefer no API calls at a slight performance cost for better portability.

Cristian Buse Cristian Buse 1,6966 silver badges29 bronze badges · Accepted Answer · 2023-03-21 17:29:44Z

Great work, as usual, definitely needed for Mac compatibility and nice as a learning exercise.

Buffer

I don't think there is a way to improve on the Byte buffer approach. After all, the maximum number of bytes would be twice as many if all the codepoints are less than 128 i.e. one byte per codepoint.

An idea would be to replace these 2 lines:

ReDim Preserve utf16(cpIndex - 1)
DecodeUTF8 = utf16

with:

DecodeUTF8 = MidB$(utf16, 1, cpIndex)

but with no performance gain as there is already a conversion going on when passing the array argument as string. We're just saving a line of code.

However, a good improvement would be to have the buffer as an Integer array instead of Byte because then all codepoints and surrogates would be simply assigned to the array with no need to mask with codePoint And &HFF& or bit shift with codePoint \ &H100&. However, this will require some CopyMemory API calls. No time to do that now but I might return with a later edit.

Performance

Raising to power as in (2 ^ (7 - numBytesOfCodePoint) - 1) is a costly operation. We could use a static array of pre-calculated powers of 2 which can be used as a lookup.

We could also use a static array to store the number of bytes needed per codepoint and remove the need to have many If statements.

Somethig like:

Static byteCount(0 To 255) As Byte
Dim i As Long
If byteCount(0) = 0 Then
 For i = &H0 To &H7F '0xxxxxxx
 byteCount(i) = 1
 Next i
 For i = &HC0 To &HDF '110xxxxx
 byteCount(i) = 2
 Next i
 For i = &HE0 To &HEF '1110xxxx
 byteCount(i) = 3
 Next i
 For i = &HF0 To &HF7 '11110xxx
 byteCount(i) = 4
 Next i
End If

and now every first byte of each codepoint can be looked up in the array with something like:
bytesPerCode = byteCount(firstCPByte). A result of 0 would raise an invalid byte error.

Invalid sequence

You already have error handling in place in case the sequence is incomplete. However there are other issues that need to be addressed. This Wikipedia article lists a few:

invalid bytes

an unexpected continuation byte

a non-continuation byte before the end of the character

the string ending before the end of the character (which can happen in simple string truncation)

an overlong encoding

a sequence that decodes to an invalid code point

For example a byte value of 128 (&H80) should be flagged as invalid but in the current approach this passes as valid on the Else: numBytesOfCodePoint = 2: End If branch.

Another example: a byte value of 11000010 (&HC2) denotes a 2-byte codepoint and the second byte should be of the form 10xxxxxx but in the current approach something like 11000000 passes as a valid continuation byte.

The static byteCount array presented in the Performance section already covers nicely for invalid first bytes.

For 2 bytes per codepoint we can avoid overlong encoding by marking C0 and C1 as invalid directly in the static array i.e replace For i = &HC0 To &HDF with For i = &HC2 To &HDF because we know that 110xxxxx 10xxxxxx must already exclude anything below value 128.

For 3 and 4 bytes we can check for overlong encoding after codepoint has been calculated with something like:

If bytesPerCode = 3 And codePoint < &H800& Then
 Err.Raise 5, methodName, "Overlong encoding"
ElseIf bytesPerCode = 4 And codePoint < &H10000 Then
 Err.Raise 5, methodName, "Overlong encoding"
End If

Or, we could just have another static array with the accepted limits and call it like this:

If codePoint < minCodePoint(bytesPerCode) Then
 Err.Raise 5, methodName, "Overlong encoding"
End If

Of course, we could cover 2 bytes per code point directly from the static array without needing to exclude C0 and C1 from the byteCount static array as mentioned above.

Solution

The solution presented below performs about 35% faster than the original solution presented in the question and this is thanks to the lookups using the static arrays. I've renamed the function name as a personal preference although DecodeUTF8 is perfectly fine.

Public Function UTF8ToUTF16LE(ByRef utf8Text As String) As String
 Const methodName As String = "UTF8ToUTF16LE"
 Dim bytesCount As Long: bytesCount = LenB(utf8Text)
 If bytesCount = 0 Then Exit Function
 '
 Dim utf8() As Byte: utf8 = utf8Text
 Dim utf16() As Byte: ReDim utf16(0 To bytesCount * 2 - 1)
 '
 Static byteCount(0 To 255) As Long
 Static mask(3 To 5) As Long
 Static minCodePoint(2 To 4) As Long
 Dim i As Long
 '
 If byteCount(0) = 0 Then
 For i = &H0& To &H7F& '0xxxxxxx
 byteCount(i) = 1
 Next i
 For i = &HC2& To &HDF& '110xxxxx - C0 and C1 are invalid (overlong encoding)
 byteCount(i) = 2
 Next i
 For i = &HE0& To &HEF& '1110xxxx
 byteCount(i) = 3
 Next i
 For i = &HF0& To &HF3& '11110xxx - 111101xx is not valid according to RFC 3629 to match UTF-16 limits
 byteCount(i) = 4
 Next i
 For i = 3 To 5
 mask(i) = 2 ^ i - 1
 Next i
 minCodePoint(2) = &H80&
 minCodePoint(3) = &H800&
 minCodePoint(4) = &H10000
 End If
 '
 Dim codePoint As Long
 Dim bytesPerCode As Long
 Dim j As Long
 Dim loByte As Byte
 Dim bufferIndex As Long: bufferIndex = 0
 '
 i = 0
 Do
 codePoint = utf8(i)
 bytesPerCode = byteCount(codePoint)
 '
 If bytesPerCode = 0 Then
 Err.Raise 5, methodName, "Invalid byte"
 ElseIf bytesPerCode = 1 Then
 utf16(bufferIndex) = codePoint
 bufferIndex = bufferIndex + 2
 ElseIf i + bytesPerCode > bytesCount Then
 Err.Raise 5, methodName, "Incomplete character"
 Else
 codePoint = codePoint And mask(7 - bytesPerCode)
 For j = 1 To bytesPerCode - 1
 loByte = utf8(i + j)
 If (loByte And &HC0&) <> &H80& Then '10xxxxxx
 Err.Raise 5, methodName, "Invalid continuation byte"
 End If
 codePoint = codePoint * &H40& + (loByte And &H3F&)
 Next j
 If codePoint < minCodePoint(bytesPerCode) Then
 Err.Raise 5, methodName, "Overlong encoding"
 End If
 '
 'Convert from Unicode codepoint to UTF-16LE bytes
 If bytesPerCode < 4 Then
 utf16(bufferIndex) = CByte(codePoint And &HFF&)
 utf16(bufferIndex + 1) = CByte(codePoint \ &H100&)
 bufferIndex = bufferIndex + 2
 Else
 Dim m As Long: m = codePoint - &H10000
 Dim lowSurrogate As Long: lowSurrogate = &HDC00& + (m And &H3FF&)
 Dim highSurrogate As Long: highSurrogate = &HD800& + (m \ &H400&)
 '
 utf16(bufferIndex) = CByte(highSurrogate And &HFF&)
 utf16(bufferIndex + 1) = CByte(highSurrogate \ &H100&)
 utf16(bufferIndex + 2) = CByte(lowSurrogate And &HFF&)
 utf16(bufferIndex + 3) = CByte(lowSurrogate \ &H100&)
 bufferIndex = bufferIndex + 4
 End If
 End If
 i = i + bytesPerCode
 Loop Until i >= bytesCount
 UTF8ToUTF16LE = MidB$(utf16, 1, bufferIndex)
End Function

Great points you are making! I knew the operations you mentioned were costly but I didn't think of the brilliant array trick! I was aware that my solution allows invalid encodings, I already knew the article you linked, but I thought it would be too complicated to account for all of that and also I expect only correct UTF-8 inputs for my use-case. I agree that for a general-purpose function, it should definitely be taken into account, and will update my SO answer. I might even add a BOM check. Regarding 16bit buffer: I prefer no API calls at a slight performance cost for better portability.

Stack Exchange Network

Transcoding UTF-8 to UTF-16-LE in VBA

1 Answer 1

Buffer

Performance

Invalid sequence

Solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Transcoding UTF-8 to UTF-16-LE in VBA

1 Answer 1

Buffer

Performance

Invalid sequence

Solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions