utf_util - Unicode Transformation Format (UTF) Utilities

The UTF utilities are used to convert between 8-, 16-, and 32-bit Unicode Transformation Format strings. The Unicode character space is 21 bits wide. Consequently, the 8-bit encoding (UTF-8) requires from 1 to 4 bytes to represent a Unicode character. The 16-bit encoding (UTF-16) requires one or two 16-bit code units. 32-bit encoding (UTF-32), of course, can represent any Unicode character as is.

UTF-16 and UTF-32 encodings arrange bytes in most significant to least significant (big-endian) order or least significant to most significant (little-endian) order. A byte-order marker (BOM) may be inserted at the beginning of an encoded string to indicate the byte order; if there is no BOM, big-endian order is assumed.

UTF-8 strings need no BOM, but some applications look for a BOM at the beginning of a file so as to detect that the file is UTF-8 as opposed to straight ASCII/Latin.

The conversion functions automatically append a NUL-/null-terminator to converted strings: 0x00 for UTF-8, 0x0000 for UTF-16, and 0x00000000 for UTF-32 strings. (Assuming enough room in the destination buffer.)

The UTF_UTIL package was written as a small, self-contained set of straightforward functions intended for converting Unicode strings in MP3 ID3v2 tags. As such, portability was of higher concern than efficiency. If you're doing a lot of Unicode string handling, you might wish to research some of the full-featured Unicode libraries.

Public Procedures

utf16bom() - returns a UTF-16 byte-order marker indication.
utf32bom() - returns a UTF-32 byte-order marker indication.
utf16len() - returns the number of code units in a UTF-16 string.
utf32len() - returns the number of code units in a UTF-32 string.
utf8get() - decodes a code point from a UTF-8 string.
utf8put() - encodes a code point to a UTF-8 string.
utf16get() - decodes a code point from a UTF-16 string.
utf16put() - encodes a code point to a UTF-16 string.
utf32get() - decodes a code point from a UTF-32 string.
utf32put() - encodes a code point to a UTF-32 string.
utf8utf16() - converts a UTF-8 string to a UTF-16 string.
utf8utf32() - converts a UTF-8 string to a UTF-32 string.
utf16utf8() - converts a UTF-16 string to a UTF-8 string.
utf32utf8() - converts a UTF-32 string to a UTF-8 string.

Source Files


(See libgpl for the complete source, including support routines and build files.)

Alex Measday  /  E-mail