Character sets are standards that map a specific set of real world characters (code elements) to specific numbers (code points) and then encode those code points as bytes (character encoding or encoding schemes). Older character sets corresponded to the characters commonly used in a specific language. Unicode is the modern character set which tries to encompass most languages. A collation is a set of rules for comparing (esp. sorting) characters in a character set or database. EG:

A document consists of binary data. A document can explicitly state the character set for itself or parts of itself (see Web Character Sets), but a viewer, like a browser, can also "override" and decode the digital data using its own choice of character encoding. EG: These two bytes (xCE xB1) will be displayed differently in different by browsers depending upon the "Character Encoding chosen for the page by the viewer. In Firefox, go to the menu for View > Character Encoding. In Internet Explorer, go to the menu for View > Encoding.

Once a set of bits have been mapped to a code point, then that code point can be mapped to its code element character, and then software can present that character and apply different fonts, styles, and sizes. EG:

In the ASCII character map the code element J has a code point of 74 (x4A). This code point is encoded as the binary number 1001010 (which is the x4A). The binary code can then be interpreted by programs such as browsers and presented to users:

Formatting Applied Example
Font J
Comic
J 
Windings
(normally a smiley face)
Style J

underlined

J

italic

Size J
HTML Size 6 
J
HTML Size 1

You may also want to see my articles on Typography.

Choosing a character set is important especially if you have to deal with international code, multiple platforms, or databases. EG:

When SQL Server is installed, a Sort Order ID must be set that is based on a character code. If you try to restore a database to that installation, then it must have the same Sort Order ID. Otherwise you may have to rebuild the database, using something like rebuildm.exe.

Here are some of the major character sets:

Here is a summary table of the major character sets:

Character Set Bits Decimal Hexadecimal
ASCII 7/8 127 7F
High ASCII 8 255 FF
Unicode, UCS-2 16 65,536 FFFF
UCS-4 31 2,147,483,648 7FFF FFFF

Here are some languages and the ISO character sheets they use. (And all of them can use Unicode.)

Language charset
Afrikaans (af) iso-8859-1, windows-1252
Albanian (sq) iso-8859-1, windows-1252
Arabic (ar) iso-8859-6. windows-1250
Basque (eu) iso-8859-1, windows-1252
Bulgarian (bg) iso-8859-5
Byelorussian (be) iso-8859-5
Catalan (ca) iso-8859-1, windows-1252
Croatian (hr) iso-8859-2. windows-1250
Czech (cs) iso-8859-2. windows-1250
Danish (da) iso-8859-1, windows-1252
Dutch (nl) iso-8859-1, windows-1252
English (en) iso-8859-1, windows-1252
Esperanto (eo) iso-8859-3
Estonian (et) iso-8859-15
Faroese (fo) iso-8859-1, windows-1252
Finnish (fi) iso-8859-1, windows-1252
French (fr) iso-8859-1, windows-1252
Galician (gl) iso-8859-1, windows-1252
German (de) iso-8859-1, windows-1252 or 1250
Greek (el) iso-8859-7
Hebrew (iw) iso-8859-8
Hungarian (hu) iso-8859-2. windows-1250
Icelandic (is) iso-8859-1, windows-1252
Inuit (Eskimo) languages iso-8859-10
Irish (ga) iso-8859-1, windows-1252
Italian (it) iso-8859-1, windows-1252
Japanese (ja) shift_jis, iso-2022-jp, euc-jp
Latvian (lv) iso-8859-13, windows-1257
Lithuanian (lt) iso-8859-13, windows-1257
Macedonian (mk) iso-8859-5
Maltese (mt) iso-8859-3*
Norwegian (no) iso-8859-1, windows-1252
Polish (pl) iso-8859-2. windows-1250
Portuguese (pt) iso-8859-1, windows-1252
Romanian (ro) iso-8859-2. windows-1250
Russian (ru) koi-8-r, iso-8859-5
Scottish (gd) iso-8859-1, windows-1252
Serbian (sr) iso-8859-5
Slovak (sk) iso-8859-2. windows-1250
Slovenian (sl) iso-8859-2. windows-1250
Spanish (es) iso-8859-1, windows-1252
Swedish (sv) iso-8859-1, windows-1252
Turkish (tr) iso-8859-9, windows-1254
Ukrainian (uk) iso-8859-5

Different companies refer to a character set by an identifier, often called a code page number. Here are some of the available code page identifiers used by Windows.

Identifier Meaning OEM/ANSI Comment
037 EBCDIC   Used in mainframes, esp. IBM.
437 MS-DOS United States OEM IBM DOS and OS/2. IBM PC Extended Character Set; Extended ASCII; High ASCII; 437 U.S. English.
500 EBCDIC "500V1"    
708 Arabic (ASMO 708) OEM  
709 Arabic (ASMO 449+, BCON V4) OEM  
710 Arabic (Transparent Arabic) OEM  
720 Arabic (Transparent ASMO) OEM  
737 Greek (formerly 437G) OEM  
775 Baltic OEM  
850 MS-DOS Multilingual (Latin I) OEM Standard MS DOS. 850 Multilingual.
852 MS-DOS Slavic (Latin II) OEM  
855 IBM Cyrillic/Russian OEM  
857 IBM Turkish OEM  
858 Multilingual. Like 850 but with the Euro symbol OEM  
860 MS-DOS Portuguese OEM  
861 MS-DOS Icelandic OEM  
862 Hebrew OEM  
863 MS-DOS Canadian-French OEM  
864 Arabic OEM  
865 MS-DOS Nordic OEM  
866 MS-DOS Cryllic/Russian OEM  
869 IBM Modern Greek OEM  
874 Thai OEM/ANSI  
875 EBCDIC    
932 Japanese OEM/ANSI Double byte.
936 GBK, Chinese (PRC, Singapore; Simplified) OEM/ANSI Double byte.
949 Korean OEM/ANSI Double byte.
950 Chinese (Taiwan; Hong Kong SAR, PRC; Traditional) OEM/ANSI Double byte. Most common variant: Chinese Traditional (Big5); big5; cn-big5; csbig5; x-x-big5;
1026 EBCDIC    
1200 Unicode (BMP of ISO 10646); UCS-2LE; Unicode little-endian ANSI Window NT/2000 and HTML. ISO-1604-6; UCS.
1201 UCS-2BE; Unicode big-endian ANSI  
1250 Windows 3.1 Eastern European ANSI  
1251 Windows 3.1 Cyrillic ANSI  
1252 Windows 3.1 US (ANSI) ANSI Windows 3.x/9x, Macs, and HTML.
"ANSI" comes in two versions (the difference is found at decimal 128-159 (hexadecimal 80-9F)):
  • Windows ANSI. Western European (Windows); windows-1252; US/Western European; Western.
  • ISO Latin 1 ANSI. Western European (ISO); iso-8859-1; ANSI_X3.4-1968; ANSI_X3.4-1986; ascii; cp367; cp819; csASCII; IBM367; ibm819; iso-ir-100; iso-ir-6; ISO646-US; iso8859-1; ISO_646.irv:1991; iso_8859-1; iso_8859-1:1987; latin1; us; us-ascii; x-ansi; iso-latin-1.
1253 Windows 3.1 Greek ANSI  
1254 Windows 3.1 Turkish ANSI  
1255 Hebrew ANSI  
1256 Arabic ANSI  
1257 Baltic ANSI  
1258 Vietnamese    
1361 Korean (Johab) OEM  
10000 Macintosh Roman   Used in every Mac on the planet. A superset of iso-8859-1 but everything after the ASCII is in a different order. x-mac-roman.
10001 Macintosh Japanese    
10006 Macintosh Greek I    
10007 Macintosh Cyrillic    
10029 Macintosh Latin 2, Central European    
10079 Macintosh Icelandic    
10081 Macintosh Turkish   x-mac-turkis.
65000 UTF-7 Unicode   utf-7; csUnicode11UTF7, unicode-1-1-utf-7, x-unicode-2-0-utf-7
65001 UTF-8 Unicode   The best! utf-8, unicode-1-1-utf-8, unicode-2-0-utf-8, x-unicode-2-0-utf-8.

Note that an alias in bold is the preferred charset ID for the HTML <meta> tag:

<meta http-equiv="Content-Type" content="text/html; charset=characterSet">

<!-- Used to explicitly state the character set used.
	Examples of characterSet include windows-1252, iso-8859-1, and utf-8. / -->

Systems save EOLs (End Of Lines) and EOFs (End Of Files) differently.

Page Modified: (Hand noted: 2006-06-07 12:29:31Z) (Auto noted: 2010-04-02 15:03:28Z)