TAGS: ASCII, Character Set, Encoding, Programming, TECH, Text, Unicode, UTF
Character sets are standards that map specific real world characters (code elements) to specific numbers (code points) and then encode those code points as bytes (character encoding or encoding schemes). EG:
A document consists of binary data. A document can explicitly state the character set for itself or parts of itself (see Web Character Sets), but a viewer, like a browser, can also "override" and decode the digital data using its own choice of character encoding. EG: These two bytes (xCE xB1) will be displayed differently in different by browsers depending upon the "Character Encoding chosen for the page by the viewer. In Firefox, go to the menu for View > Character Encoding. In Internet Explorer, go to the menu for View > Encoding.
1100111010110001. In binary.206 177. In decimal.CE B1. In hexadecimal.α.Once a set of bits have been mapped to a code point, then that code point can be mapped to its code element character, and then software can present that character and apply different fonts, styles, and sizes. EG:
In the ASCII character map the code element J has a code point of 74 (x4A). This code point is encoded as the binary number 1001010 (which is the x4A). The binary code can then be interpreted by programs such as browsers and presented to users:
| Formatting Applied | Example | |
| Font |
J
Comic |
J
Windings (normally a smiley face) |
| Style |
J
underlined |
J
italic |
| Size |
J
HTML Size 6 |
J
HTML Size 1 |
You may also want to see my articles on Typography.
Choosing a character set is important especially if you have to deal with international code, multiple platforms, or databases. EG:
When SQL Server is installed, a Sort Order ID must be set that is based on a character code. If you try to restore a database to that installation, then it must have the same Sort Order ID. Otherwise you may have to rebuild the database, using something like
rebuildm.exe.
Here are some of the major character sets:
iso-8859-1) is probably the most prevalent.Here is a summary table of the major character sets:
| Character Set | Bits | Decimal | Hexadecimal |
|---|---|---|---|
| ASCII | 7/8 | 127 | 7F |
| High ASCII | 8 | 255 | FF |
| Unicode, UCS-2 | 16 | 65,536 | FFFF |
| UCS-4 | 31 | 2,147,483,648 | 7FFF FFFF |
Here are some languages and the ISO character sheets they use. (And all of them can use Unicode.)
| Language | charset |
|---|---|
| Afrikaans (af) | iso-8859-1, windows-1252 |
| Albanian (sq) | iso-8859-1, windows-1252 |
| Arabic (ar) | iso-8859-6. windows-1250 |
| Basque (eu) | iso-8859-1, windows-1252 |
| Bulgarian (bg) | iso-8859-5 |
| Byelorussian (be) | iso-8859-5 |
| Catalan (ca) | iso-8859-1, windows-1252 |
| Croatian (hr) | iso-8859-2. windows-1250 |
| Czech (cs) | iso-8859-2. windows-1250 |
| Danish (da) | iso-8859-1, windows-1252 |
| Dutch (nl) | iso-8859-1, windows-1252 |
| English (en) | iso-8859-1, windows-1252 |
| Esperanto (eo) | iso-8859-3 |
| Estonian (et) | iso-8859-15 |
| Faroese (fo) | iso-8859-1, windows-1252 |
| Finnish (fi) | iso-8859-1, windows-1252 |
| French (fr) | iso-8859-1, windows-1252 |
| Galician (gl) | iso-8859-1, windows-1252 |
| German (de) | iso-8859-1, windows-1252 or 1250 |
| Greek (el) | iso-8859-7 |
| Hebrew (iw) | iso-8859-8 |
| Hungarian (hu) | iso-8859-2. windows-1250 |
| Icelandic (is) | iso-8859-1, windows-1252 |
| Inuit (Eskimo) languages | iso-8859-10 |
| Irish (ga) | iso-8859-1, windows-1252 |
| Italian (it) | iso-8859-1, windows-1252 |
| Japanese (ja) | shift_jis, iso-2022-jp, euc-jp |
| Latvian (lv) | iso-8859-13, windows-1257 |
| Lithuanian (lt) | iso-8859-13, windows-1257 |
| Macedonian (mk) | iso-8859-5 |
| Maltese (mt) | iso-8859-3* |
| Norwegian (no) | iso-8859-1, windows-1252 |
| Polish (pl) | iso-8859-2. windows-1250 |
| Portuguese (pt) | iso-8859-1, windows-1252 |
| Romanian (ro) | iso-8859-2. windows-1250 |
| Russian (ru) | koi-8-r, iso-8859-5 |
| Scottish (gd) | iso-8859-1, windows-1252 |
| Serbian (sr) | iso-8859-5 |
| Slovak (sk) | iso-8859-2. windows-1250 |
| Slovenian (sl) | iso-8859-2. windows-1250 |
| Spanish (es) | iso-8859-1, windows-1252 |
| Swedish (sv) | iso-8859-1, windows-1252 |
| Turkish (tr) | iso-8859-9, windows-1254 |
| Ukrainian (uk) | iso-8859-5 |
Here are some of the available code page identifiers used by Windows.
| Identifier | Meaning | OEM/ANSI | Comment |
|---|---|---|---|
| 037 | EBCDIC | Used in mainframes, esp. IBM. | |
| 437 | MS-DOS United States | OEM | IBM DOS and OS/2. Aka: IBM PC Extended Character Set; Extended ASCII; High ASCII; 437 U.S. English. |
| 500 | EBCDIC "500V1" | ||
| 708 | Arabic (ASMO 708) | OEM | |
| 709 | Arabic (ASMO 449+, BCON V4) | OEM | |
| 710 | Arabic (Transparent Arabic) | OEM | |
| 720 | Arabic (Transparent ASMO) | OEM | |
| 737 | Greek (formerly 437G) | OEM | |
| 775 | Baltic | OEM | |
| 850 | MS-DOS Multilingual (Latin I) | OEM | Standard MS DOS. Aka: 850 Multilingual. |
| 852 | MS-DOS Slavic (Latin II) | OEM | |
| 855 | IBM Cyrillic (primarily Russian) | OEM | |
| 857 | IBM Turkish | OEM | |
| 860 | MS-DOS Portuguese | OEM | |
| 861 | MS-DOS Icelandic | OEM | |
| 862 | Hebrew | OEM | |
| 863 | MS-DOS Canadian-French | OEM | |
| 864 | Arabic | OEM | |
| 865 | MS-DOS Nordic | OEM | |
| 866 | MS-DOS Russian | OEM | |
| 869 | IBM Modern Greek | OEM | |
| 874 | Thai | OEM/ANSI | |
| 875 | EBCDIC | ||
| 932 | Japanese | OEM/ANSI | Double byte. |
| 936 | Chinese (PRC, Singapore; Simplified) | OEM/ANSI | Double byte. |
| 949 | Korean | OEM/ANSI | Double byte. |
| 950 | Chinese (Taiwan; Hong Kong SAR, PRC; Traditional) | OEM/ANSI | Double byte. Most common variant: Chinese Traditional (Big5); big5; cn-big5; csbig5; x-x-big5; |
| 1026 | EBCDIC | ||
| 1200 | Unicode (BMP of ISO 10646) | ANSI | Window NT/2000 and HTML. Aka: ISO-1604-6; UCS;
|
| 1250 | Windows 3.1 Eastern European | ANSI | |
| 1251 | Windows 3.1 Cyrillic | ANSI | |
| 1252 | Windows 3.1 US (ANSI) | ANSI | Windows 3.x/9x, Macs, and HTML. ANSI comes in two versions (the difference is found at decimal 128-159 (hexadecimal 80-9F)):
|
| 1253 | Windows 3.1 Greek | ANSI | |
| 1254 | Windows 3.1 Turkish | ANSI | |
| 1255 | Hebrew | ANSI | |
| 1256 | Arabic | ANSI | |
| 1257 | Baltic | ANSI | |
| 1258 | Vietnamese | ||
| 1361 | Korean (Johab) | OEM | |
| 10000 | Macintosh Roman | Used in every Mac on the planet. A superset of iso-8859-1 but everything after the ASCII is in a different order. Aka: x-mac-roman. | |
| 10001 | Macintosh Japanese | ||
| 10006 | Macintosh Greek I | ||
| 10007 | Macintosh Cyrillic | ||
| 10029 | Macintosh Latin 2 | ||
| 10079 | Macintosh Icelandic | ||
| 10081 | Macintosh Turkish | Aka: x-mac-turkis. |
Note that aliases in bold is
the preferred charset ID for the HTML <meta> tag:
<meta http-equiv="Content-Type" content="text/html; charset=characterSet">
<!-- Used to explicitly state the character set used.
Examples of characterSet include windows-1252, iso-8859-1, and utf-8. / -->
Systems save EOLs (End Of Lines) and EOFs (End Of Files) differently.
Page Modified: (Hand noted: 2006-06-07 12:29:31Z) (Auto noted: 2008-07-22 20:28:00Z)