utf_util- Unicode Transformation Format (UTF) Utilities
The UTF utilities are used to convert between 8-, 16-, and 32-bit Unicode Transformation Format strings. The Unicode character space is 21 bits wide. Consequently, the 8-bit encoding (UTF-8) requires from 1 to 4 bytes to represent a Unicode character. The 16-bit encoding (UTF-16) requires one or two 16-bit code units. 32-bit encoding (UTF-32), of course, can represent any Unicode character as is.
UTF-16 and UTF-32 encodings arrange bytes in most significant to least significant (big-endian) order or least significant to most significant (little-endian) order. A byte-order marker (BOM) may be inserted at the beginning of an encoded string to indicate the byte order; if there is no BOM, big-endian order is assumed.
UTF-8 strings need no BOM, but some applications look for a BOM at the beginning of a file so as to detect that the file is UTF-8 as opposed to straight ASCII/Latin.
The conversion functions automatically append a NUL-/null-terminator to converted strings: 0x00 for UTF-8, 0x0000 for UTF-16, and 0x00000000 for UTF-32 strings. (Assuming enough room in the destination buffer.)
The UTF_UTIL package was written as a small, self-contained set of straightforward functions intended for converting Unicode strings in MP3 ID3v2 tags. As such, portability was of higher concern than efficiency. If you're doing a lot of Unicode string handling, you might wish to research some of the full-featured Unicode libraries.
utf16bom()- returns a UTF-16 byte-order marker indication.
utf32bom()- returns a UTF-32 byte-order marker indication.
utf16len()- returns the number of code units in a UTF-16 string.
utf32len()- returns the number of code units in a UTF-32 string.
utf8get()- decodes a code point from a UTF-8 string.
utf8put()- encodes a code point to a UTF-8 string.
utf16get()- decodes a code point from a UTF-16 string.
utf16put()- encodes a code point to a UTF-16 string.
utf32get()- decodes a code point from a UTF-32 string.
utf32put()- encodes a code point to a UTF-32 string.
utf8utf16()- converts a UTF-8 string to a UTF-16 string.
utf8utf32()- converts a UTF-8 string to a UTF-32 string.
utf16utf8()- converts a UTF-16 string to a UTF-8 string.
utf32utf8()- converts a UTF-32 string to a UTF-8 string.
libgpl for the
complete source, including support routines and build files.)