Unicode is the best and hopefully last character set.
ASCII uses 7 or 8 bits to encode up to 127 characters (x0-x7F) and is fine was fine for basic English. Character sets that fully utilize the 8th bit can encode up to 255 characters (x80-xFF) and is sufficient for many European languages. However it can be onerous in a multi-language scenario to deal with multiple character sets. This is where Unicode comes in.
Unicode is a A MBCS (Multi-Byte Character Set) that uses 2-4 bytes worth of possible code points. Unicode was developed by Unicode.org (The Unicode Project). Unicode is the sensible international and cross-platform character set.
ISO 10646 defines the UCS (Universal Character Set) used by Unicode. UCS was designed to be a superset of all other character sets. ISO 10646-1 was first published in 1993. ISO 10641-2 was published in 2001 and added characters outside of the BMP (Basic Multilingual Plane).
Unicode 1.1 corresponded to ISO 10646-1:1993, Unicode 3.0 corresponds to ISO 10646-1:2000, and Unicode 3.2 adds ISO 10646-2:2001. All Unicode versions since 2.0 are compatible, only new characters will be added, no existing characters will be removed or renamed in the future. In general, Unicode is more comprehensive than ISO 10646.
As a side most of the code points are for complete characters (EG: A = U+0041) or pre-composed characters (EG: Ä = U+00C4). However some of the code points are for combining characters which work similar to dead keys (See Shortcuts) or non-spacing accent keys on a typewriter. EG: Ä = A + a combining diaeresis = U+0041 U+0308.
Microsoft started supporting Unicode with Window 2000 and SQL Server 7. Microsoft defaults to UCS-2 (i.e. UTF-16LE). UTF-8 is the probably the best Unicode encoding scheme for Linux and Unix.
UCS-2 (Universal Character Set, aka BMP; Plane 0;) uses 16 bits (x0 to xFFFF = 0 to 65,535) for the code points. UCS-2 was formed when Unicode Standard 1.1 and ISO-16046-1 merged together.
UCS-4 uses 31 bits ( x0 to x80000000 = 0 to 2,147,483,648) for code points but is not as popular. UCS-2 is part of ISO-16046-2.
Each Unicode (UCS-2) code point is usually referenced by its hexadecimal value, and is put in this format: U+4DigitHexValue. EG: U+00A5 is the code point for the Yen symbol (¥). Related code elements form groups called scripts. The number space set aside for a script is called its code block. Code blocks usually start at some nice round hexadecimal number. EG:
Here is approximately how the two bytes are distributed:
General Scripts | Symbols | | CJK Auxiliary Compatibility | | | CJK Ideographs Private Use | | | | | | | V V V V V V +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ |***|* |** |***|***|***|***|***|***| | | | | | ##|##*| +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 00 10 20 30 40 50 60 70 80 90 A0 B0 C0 D0 E0 F0 FE +---+ +---+ +---+ |***| Allocated | | Reserved |###| Private Use +---+ +---+ +---+
To review character set terminology:
Unicode has additional complexity:
A BOM (Byte Order Mark) may be inserted at the beginning of data streams. The Unicode character U+FEFF, aka ZWNBSP (Zero Width Non-Breaking SPace), is used as the BOM. Simply convert the character into its appropriate encoding scheme. When displayed on the system viewing this page, the FE and FF bytes are displayed as þ and ÿ, respectively. There are several reasons for using a BOM:
Unicode has the following encoding schemes:
2B 2F 76 38 2D.EF BB BF. A BOM is optional with UTF-8 because UTF-8 has a fixed byte order. When displayed on the system viewing this page, the EF BB BF bytes are displayed as ï » ¿.| UCS-4 Code Points | UTF-8 Bytes/Octets | # of Free Bits | # Code Points Expressible | Note |
|---|---|---|---|---|
| U-00000000 - U-0000007F | 0xxxxxxx | 7 | 2^7 = 128 | ASCII |
| U-00000080 - U-000007FF | 110xxxxx 10xxxxxx | 5+6 = 11 | 2^11 = 2,048 | ISO Latin 1 and more |
| U-00000800 - U-0000FFFF | 1110xxxx 10xxxxxx 10xxxxxx | 4+6+6 = 16 | 2^16 = 65,536 | Max of 3 bytes covers UCS-2 |
| U-00010000 - U-001FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 3+6+6+6 = 21 | 2^21 = 2,097,152 | |
| U-00200000 - U-03FFFFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | 2+6+6+6+6 = 26 | 2^26 = 67,108,864 | |
| U-04000000 - U-7FFFFFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | 1+6+6+6+6+6 = 31 | 2^31 = 2,147,483,648 | Max of 6 bytes covers UCS-4 |
charset=unicodeFFFE (opposite of the actual BOM!). The required BOM is FE FF, (þ ÿ ).charset=unicode. The required BOM is FF FE, (ÿ þ).00 00 FE FF.FF FE 00 00.2007-07-28 02:27:45Z