UTF-8 (OMUTF8)

This library is used to process data files that contain UTF-8 encoded data.

The function utf8.char matches UTF-8 characters in the data. The function utf8.single-byte-char matches only ASCII UTF-8 characters, whereas the function utf8.multi-byte-char matches double-byte UTF-8 characters.

The function utf8.code-point is used to convert a UTF-8 character (that is, a sequence of bytes that represents a character in UTF-8) to its binary character value, while the function utf8.encoding converts a binary character value to UTF-8 (that is, to that sequence of bytes which represents that character value in UTF-8).

Example

The following program shows how utf8.single-byte-char and utf8.multi-byte-char can be used to pattern-match UTF-8 encoded data, and how utf8.code-point can be used to convert the captured bytes to their binary value.

 import "omutf8.xmd" prefixed by utf8.
 
 process
 repeat scan "flamb%195#%169#"
 match utf8.single-byte-char+ => c
 output c
 
 match utf8.multi-byte-char => c
 local integer n initial { utf8.code-point of c }
 
 do when n > 255
 output "&#x" || "16rud" % n || ";"
 
 else
 output "b" % n
 done
 again

Functions