|
|
|
utf_util
- Unicode Transformation Format (UTF) UtilitiesThe UTF utilities are used to convert between 8-, 16-, and 32-bit Unicode Transformation Format strings. The Unicode character space is 21 bits wide. Consequently, the 8-bit encoding (UTF-8) requires from 1 to 4 bytes to represent a Unicode character. The 16-bit encoding (UTF-16) requires one or two 16-bit code units. 32-bit encoding (UTF-32), of course, can represent any Unicode character as is.
UTF-16 and UTF-32 encodings arrange bytes in most significant to least significant (big-endian) order or least significant to most significant (little-endian) order. A byte-order marker (BOM) may be inserted at the beginning of an encoded string to indicate the byte order; if there is no BOM, big-endian order is assumed.
UTF-8 strings need no BOM, but some applications look for a BOM at the beginning of a file so as to detect that the file is UTF-8 as opposed to straight ASCII/Latin.
The conversion functions automatically append a NUL-/null-terminator to converted strings: 0x00 for UTF-8, 0x0000 for UTF-16, and 0x00000000 for UTF-32 strings. (Assuming enough room in the destination buffer.)
The UTF_UTIL package was written as a small, self-contained set of straightforward functions intended for converting Unicode strings in MP3 ID3v2 tags. As such, portability was of higher concern than efficiency. If you're doing a lot of Unicode string handling, you might wish to research some of the full-featured Unicode libraries.
utf16bom()
- returns a UTF-16 byte-order marker indication.
utf32bom()
- returns a UTF-32 byte-order marker indication.
utf16len()
- returns the number of code units in a UTF-16
string.
utf16nlen()
- returns the number of code units in a
length-limited UTF-16 string.
utf32len()
- returns the number of code units in a UTF-32
string.
utf32nlen()
- returns the number of code units in a
length-limited UTF-32 string.
utf8get()
- decodes a code point from a UTF-8 string.
utf8put()
- encodes a code point to a UTF-8 string.
utf16get()
- decodes a code point from a UTF-16 string.
utf16put()
- encodes a code point to a UTF-16 string.
utf32get()
- decodes a code point from a UTF-32 string.
utf32put()
- encodes a code point to a UTF-32 string.
utf8to16()
- converts a UTF-8 string to a UTF-16 string.
utf8to32()
- converts a UTF-8 string to a UTF-32 string.
utf16to8()
- converts a UTF-16 string to a UTF-8 string.
utf32to()
- converts a UTF-32 string to a UTF-8 string.
utf_util.c
utf_util.h
(See libgpl
for the
complete source, including support routines and build files.)