13. Libraries¶
Programming languages have no or basic support of Unicode. Libraries are required to get a full support of Unicode on all platforms.
13.1. Qt library¶
Qt is a big C++ library covering different topics, but it is typically used to create graphical interfaces. It is distributed under the GNU LGPL license (version 2.1), and is also available under a commercial license.
13.1.1. Character and string classes¶
QChar
is a Unicode character, only able to store BMP characters. It is implemented using a 16 bits unsigned number. Interesting
QChar
methods:
isSpace()
: True if the character category is separator (Zl, Zp or Zs)
toUpper()
: convert to upper case
QString
is a character string implemented as an array of
QChar
using UTF-16. A Non-BMP character is
stored as two QChar
(a surrogate pair). Interesting
QString
methods:
toAscii()
,fromAscii()
: encode to/decode from ASCII
toLatin1()
,fromLatin1()
: encode to/decode from ISO 8859-1
utf16()
,fromUtf16()
: encode to/decode to UTF-16 (in the host endian)
normalized()
: normalize to NFC, NFD, NFKC or NFKD
Qt decodes literal byte strings from ISO 8859-1 using the
QLatin1String
class, a thin wrapper to char*
. QLatin1String
is a character string storing each character as a single byte. It is possible
because it only supports characters in U+0000—U+00FF range. QLatin1String
cannot be used to manipulate text, it has a smaller API than QString
. For
example, it is not possible to concatenate two QLatin1String
strings.
13.1.2. Codec¶
QTextCodec.codecForLocale()
gets the locale encoding codec:
Windows: ANSI code page
Otherwise: the locale encoding. Try
nl_langinfo(CODESET)
, orLC_ALL
,LC_CTYPE
,LANG
environment variables. If no one gives any useful information, fallback to ISO 8859-1.
13.1.3. Filesystem¶
QFile.encodeName()
:
QFile.decodeName()
is the reverse operation.
Qt has two implementations of its QFSFileEngine
:
Windows: use Windows native API
UNIX: use POSIX API. Examples:
fopen()
,getcwd()
orget_current_dir_name()
,mkdir()
, etc.
Related classes: QFile
, QFileInfo
, QAbstractFileEngineHandler
,
QFSFileEngine
.
13.2. The glib library¶
The glib library is a great C library distributed under the GNU LGPL license (version 2.1).
13.2.1. Character strings¶
The gunichar
type is a character. It is able to store any Unicode 6.0
character (U+0000—U+10FFFF).
The glib library has no character string type. It uses byte
strings using the gchar*
type, but most functions use
UTF-8 encoded strings.
13.2.2. Codec functions¶
g_convert()
: decode from an encoding and encode to another encoding with the iconv library. Useg_convert_with_fallback()
to choose how to handle undecodable bytes and unencodable characters.
g_locale_from_utf8()
/g_locale_to_utf8()
: encode to/decode from the current locale encoding.
g_get_charset()
: get the locale encoding
Windows: current ANSI code page
OS/2: current code page (call
DosQueryCp()
)other: try
nl_langinfo(CODESET)
, orLC_ALL
,LC_CTYPE
orLANG
environment variables
g_utf8_get_char()
: get the first character of an UTF-8 string asgunichar
13.2.3. Filename functions¶
g_filename_from_utf8()
/g_filename_to_utf8()
: encode/decode a filename to/from UTF-8
g_filename_display_name()
: human readable version of a filename. Try to decode the filename from each encoding ofg_get_filename_charsets()
encoding list. If all decoding failed, decode the filename from UTF-8 and replace undecodable bytes by � (U+FFFD).
g_get_filename_charsets()
: get the list of charsets used to decode and encode filenames.g_filename_display_name()
tries each encoding of this list, other functions just use the first encoding. Use UTF-8 on Windows. On other operating systems, use:
G_FILENAME_ENCODING
environment variable (if set): comma-separated list of character set names, the special token"@locale"
is taken to mean the locale encodingor UTF-8 if
G_BROKEN_FILENAMES
environment variable is setor call
g_get_charset()
(the locale encoding)
13.3. iconv library¶
libiconv is a library to encode and decode text in different encodings. It is distributed under the GNU LGPL license. It supports a lot of encodings including rare and old encodings.
By default, libiconv is strict: an unencodable character raise an error. You can ignore these characters
by adding the //IGNORE
suffix to the encoding name. There is also the //TRANSLIT
suffix to replace unencodable characters by similarly looking
characters.
PHP has a builtin binding of iconv.
13.4. ICU libraries¶
International Components for Unicode (ICU) is a mature, widely used set of C, C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is an open source project distributed under the MIT license.
13.5. libunistring¶
libunistring provides functions for manipulating Unicode strings and for manipulating C strings according to the Unicode standard. It is distributed under the GNU LGPL license version 3.