Unicode encodings ================= .. index:: UTF-8 .. _utf8: .. _UTF-8: UTF-8 ----- UTF-8 is a multibyte encoding able to encode the whole Unicode charset. An encoded character takes between 1 and 4 bytes. UTF-8 encoding supports longer byte sequences, up to 6 bytes, but the biggest code point of Unicode 6.0 (U+10FFFF) only takes 4 bytes. .. TODO:: NELLE - I don't understand. Why would UTF-8 support longer 5 bytes sequences if it is useless ? It is possible to be sure that a :ref:`byte string ` is encoded to UTF-8, because UTF-8 adds markers to each byte. For the first byte of a multibyte character, bit 7 and bit 6 are set (``0b11xxxxxx``); the next bytes have bit 7 set and bit 6 unset (``0b10xxxxxx``). Another cool feature of UTF-8 is that it has no endianness (it can be read in big or little endian order, it does not matter). Another advantage of UTF-8 is that most :ref:`C ` bytes functions are compatible with UTF-8 encoded strings (e.g. :c:func:`strcat` or :c:func:`printf`), whereas they fail with UTF-16 and UTF-32 encoded strings because these encodings encode small codes with nul bytes. .. todo:: write a section: handle NUL byte/character The problem with UTF-8, if you compare it to :ref:`ASCII` or :ref:`ISO-8859-1`, is that it is a multibyte encoding: you cannot access a character by its character index directly, you have to iterate on each character because each character may have a different length in bytes. If getting a character by its index is a common operation in your program, use a :ref:`character string ` instead of a :ref:`UTF-8 encoded string `. .. TODO:: NELLE la première phrase ne me semble pas "correcte" d'un point de vue grammatical : "It is possible to be sure that a byte string is encoded by UTF-8, because UTF-8 adds markers to each byte." => "Thanks to markers placed at each byte, it is possible to make sure a byte string is encoded in UTF-8" .. TODO:: NELLE - "The problem with" .. TODO:: NELLE - Il me semble que tu utilises endianness, sans avoir expliquer avant ce que c'était. Considères tu que le lecteur connaît ? .. TODO:: NELLE - "If getting" a partir de là, je ne comprends plus bien .. seealso:: :ref:`Non-strict UTF-8 decoder ` and :ref:`Is UTF-8? `. .. index:: UCS-2, UCS-4, UTF-16, UTF-32 .. _ucs2: .. _ucs4: .. _utf16: .. _utf32: UCS-2, UCS-4, UTF-16 and UTF-32 ------------------------------- **UCS-2** and **UCS-4** encodings :ref:`encode ` each code point to exactly one unit of, respectivelly, 16 and 32 bits. UCS-4 is able to encode all Unicode 6.0 code points, whereas UCS-2 is limited to :ref:`BMP ` characters. These encodings are practical because the length in units is the number of characters. **UTF-16** and **UTF-32** encodings use, respectively, 16 and 32 bits units. UTF-16 encodes code points bigger than U+FFFF using two units: a :ref:`surrogate pair `. UCS-2 can be :ref:`decoded ` from UTF-16. UTF-32 is also supposed to use more than one unit for big code points, but in practice, it only requires one unit to store all code points of Unicode 6.0. That's why UTF-32 and UCS-4 are the same encoding. +----------+-----------+-----------------+ | Encoding | Word size | Unicode support | +==========+===========+=================+ | UCS-2 | 16 bits | BMP only | +----------+-----------+-----------------+ | UTF-16 | 16 bits | Full | +----------+-----------+-----------------+ | UCS-4 | 32 bits | Full | +----------+-----------+-----------------+ | UTF-32 | 32 bits | Full | +----------+-----------+-----------------+ :ref:`Windows 95 ` uses UCS-2, whereas Windows 2000 uses UTF-16. .. note:: UCS stands for *Universal Character Set*, and UTF stands for *UCS Transformation format*. .. index:: UTF-7 .. _utf7: UTF-7 ----- The UTF-7 encoding is similar to the :ref:`UTF-8 encoding `, except that it uses 7 bits units instead of 8 bits units. It is used for example in emails with server which are not "8 bits clean". .. index:: BOM .. _bom: Byte order marks (BOM) ---------------------- :ref:`UTF-16 ` and :ref:`UTF-32 ` use units bigger than 8 bits, and so are sensitive to endianness. A single unit can be stored as big endian (most significant bits first) or little endian (less significant bits first). BOM is a short byte sequence to indicate the encoding and the endian. It's the U+FEFF code point encoded with the given UTF encoding. Unicode defines 6 different BOM: +----------------------------------------+--------------------------+---------------+ | BOM | Encoding | Endian | +========================================+==========================+===============+ | ``0x2B 0x2F 0x76 0x38 0x2D`` (5 bytes) | :ref:`UTF-7 ` | *endianless* | +----------------------------------------+--------------------------+---------------+ | ``0xEF 0xBB 0xBF`` (3) | :ref:`UTF-8 ` | *endianless* | +----------------------------------------+--------------------------+---------------+ | ``0xFF 0xFE`` (2) | :ref:`UTF-16-LE ` | little endian | +----------------------------------------+--------------------------+---------------+ | ``0xFE 0xFF`` (2) | :ref:`UTF-16-BE ` | big endian | +----------------------------------------+--------------------------+---------------+ | ``0xFF 0xFE 0x00 0x00`` (4) | :ref:`UTF-32-LE ` | little endian | +----------------------------------------+--------------------------+---------------+ | ``0x00 0x00 0xFE 0xFF`` (4) | :ref:`UTF-32-BE ` | big endian | +----------------------------------------+--------------------------+---------------+ UTF-32-LE BOMs starts with UTF-16-LE BOM. "UTF-16" and "UTF-32" encoding names are imprecise: depending of the context, format or protocol, it means UTF-16 and UTF-32 with BOM markers, or UTF-16 and UTF-32 in the host endian without BOM. On Windows, "UTF-16" usually means UTF-16-LE. Some Windows applications, like notepad.exe, use UTF-8 BOM, whereas many applications are unable to detect the BOM, and so the BOM causes trouble. UTF-8 BOM should not be used for better interoperability. .. todo:: which troubles? .. index:: Surrogate pair .. _surrogates: UTF-16 surrogate pairs ---------------------- Surrogates are characters in the Unicode range U+D800—U+DFFF (2,048 code points): it is also the :ref:`Unicode category ` "surrogate" (Cs). The range is composed of two parts: * U+D800—U+DBFF (1,024 code points): high surrogates * U+DC00—U+DFFF (1,024 code points): low surrogates In :ref:`UTF-16 `, characters in ranges U+0000—U+D7FF and U+E000—U+FFFD are stored as a single 16 bits unit. :ref:`Non-BMP ` characters (range U+10000—U+10FFFF) are stored as "surrogate pairs", two 16 bits units: a high surrogate (in range U+D800—U+DBFF) followed by a low surrogate (in range U+DC00—U+DFFF). A lone surrogate character is invalid in UTF-16, surrogate characters are always written as pairs (high followed by low). Examples of surrogate pairs: +-----------+------------------+ | Character | Surrogate pair | +===========+==================+ | U+10000 | {U+D800, U+DC00} | +-----------+------------------+ | U+10E6D | {U+D803, U+DE6D} | +-----------+------------------+ | U+1D11E | {U+D834, U+DD1E} | +-----------+------------------+ | U+10FFFF | {U+DBFF, U+DFFF} | +-----------+------------------+ .. note:: U+10FFFF is the highest code point encodable to UTF-16 and the highest code point of the :ref:`Unicode Character Set ` 6.0. The {U+DBFF, U+DFFF} surrogate pair is the last available pair. An :ref:`UTF-8 ` or :ref:`UTF-32 ` encoder should not encode surrogate characters (U+D800—U+DFFF), see :ref:`Non-strict UTF-8 decoder `. .. highlight:: c :ref:`C ` functions to create a surrogate pair (:ref:`encode ` to UTF-16) and to join a surrogate pair (:ref:`decode ` from UTF-16):: #include void encode_utf16_pair(uint32_t character, uint16_t *units) { unsigned int code; assert(0x10000 <= character && character <= 0x10FFFF); code = (character - 0x10000); units[0] = 0xD800 | (code >> 10); units[1] = 0xDC00 | (code & 0x3FF); } uint32_t decode_utf16_pair(uint16_t *units) { uint32_t code; assert(0xD800 <= units[0] && units[0] <= 0xDBFF); assert(0xDC00 <= units[1] && units[1] <= 0xDFFF); code = 0x10000; code += (units[0] & 0x03FF) << 10; code += (units[1] & 0x03FF); return code; }