Character sets are standards that map specific real world characters (code elements) to specific numbers (code points) and then encode those code points as bytes (character encoding or encoding schemes). EG:

A document consists of binary data. A document can explicitly state the character set for itself or parts of itself (see Web Character Sets), but a viewer, like a browser, can also "override" and decode the digital data using its own choice of character encoding. EG: These two bytes (xCE xB1) will be displayed differently in different by browsers depending upon the "Character Encoding chosen for the page by the viewer. In Firefox, go to the menu for View > Character Encoding. In Internet Explorer, go to the menu for View > Encoding.

Once a set of bits have been mapped to a code point, then that code point can be mapped to its code element character, and then software can present that character and apply different fonts, styles, and sizes. EG:

In the ASCII character map the code element J has a code point of 74 (x4A). This code point is encoded as the binary number 1001010 (which is the x4A). The binary code can then be interpreted by programs such as browsers and presented to users:

Formatting Applied Example
Font J
Comic
J 
Windings
(normally a smiley face)
Style J

underlined

J

italic

Size J
HTML Size 6 
J
HTML Size 1

You may also want to see my articles on Typography.

Choosing a character set is important especially if you have to deal with international code, multiple platforms, or databases. EG:

When SQL Server is installed, a Sort Order ID must be set that is based on a character code. If you try to restore a database to that installation, then it must have the same Sort Order ID. Otherwise you may have to rebuild the database, using something like rebuildm.exe.

Here are some of the major character sets:

Here is a summary table of the major character sets:

Character Set Bits Decimal Hexadecimal
ASCII 7/8 127 7F
High ASCII 8 255 FF
Unicode, UCS-2 16 65,536 FFFF
UCS-4 31 2,147,483,648 7FFF FFFF

Here are some languages and the ISO character sheets they use. (And all of them can use Unicode.)

Language   charset 
 Afrikaans (af)   iso-8859-1, windows-1252 
 Albanian (sq)   iso-8859-1, windows-1252 
 Arabic (ar)   iso-8859-6. windows-1250
 Basque (eu)   iso-8859-1, windows-1252 
 Bulgarian (bg)   iso-8859-5 
 Byelorussian (be)   iso-8859-5 
 Catalan (ca)   iso-8859-1, windows-1252 
 Croatian (hr)   iso-8859-2. windows-1250
 Czech (cs)   iso-8859-2. windows-1250
 Danish (da)   iso-8859-1, windows-1252 
 Dutch (nl)   iso-8859-1, windows-1252 
 English (en)   iso-8859-1, windows-1252 
 Esperanto (eo)   iso-8859-3
 Estonian (et)   iso-8859-15 
 Faroese (fo)   iso-8859-1, windows-1252 
 Finnish (fi)   iso-8859-1, windows-1252 
 French (fr)   iso-8859-1, windows-1252 
 Galician (gl)   iso-8859-1, windows-1252 
 German (de)   iso-8859-1, windows-1252 or 1250
 Greek (el)   iso-8859-7 
 Hebrew (iw)   iso-8859-8 
 Hungarian (hu)   iso-8859-2. windows-1250
 Icelandic (is)   iso-8859-1, windows-1252 
 Inuit (Eskimo) languages   iso-8859-10
 Irish (ga)   iso-8859-1, windows-1252 
 Italian (it)   iso-8859-1, windows-1252 
 Japanese (ja)   shift_jis, iso-2022-jp, euc-jp 
 Latvian (lv)   iso-8859-13, windows-1257 
 Lithuanian (lt)   iso-8859-13, windows-1257 
 Macedonian (mk)   iso-8859-5 
 Maltese (mt)   iso-8859-3* 
 Norwegian (no)   iso-8859-1, windows-1252 
 Polish (pl)   iso-8859-2. windows-1250
 Portuguese (pt)   iso-8859-1, windows-1252 
 Romanian (ro)   iso-8859-2. windows-1250
 Russian (ru)   koi-8-r, iso-8859-5 
 Scottish (gd)   iso-8859-1, windows-1252 
 Serbian (sr)   iso-8859-5 
 Slovak (sk)   iso-8859-2. windows-1250
 Slovenian (sl)   iso-8859-2. windows-1250
 Spanish (es)   iso-8859-1, windows-1252 
 Swedish (sv)   iso-8859-1, windows-1252 
 Turkish (tr)   iso-8859-9, windows-1254 
 Ukrainian (uk)   iso-8859-5   

Here are some of the available code page identifiers used by Windows.

Identifier Meaning OEM/ANSI Comment
037 EBCDIC   Used in mainframes, esp. IBM.
437 MS-DOS United States OEM IBM DOS and OS/2.
Aka: IBM PC Extended Character Set; Extended ASCII; High ASCII; 437 U.S. English.
500 EBCDIC "500V1"    
708 Arabic (ASMO 708) OEM  
709 Arabic (ASMO 449+, BCON V4) OEM  
710 Arabic (Transparent Arabic) OEM  
720 Arabic (Transparent ASMO) OEM  
737 Greek (formerly 437G) OEM  
775 Baltic OEM  
850 MS-DOS Multilingual (Latin I) OEM Standard MS DOS.
Aka: 850 Multilingual.
852 MS-DOS Slavic (Latin II) OEM  
855 IBM Cyrillic (primarily Russian) OEM  
857 IBM Turkish OEM  
860 MS-DOS Portuguese OEM  
861 MS-DOS Icelandic OEM  
862 Hebrew OEM  
863 MS-DOS Canadian-French OEM  
864 Arabic OEM  
865 MS-DOS Nordic OEM  
866 MS-DOS Russian OEM  
869 IBM Modern Greek OEM  
874 Thai OEM/ANSI  
875 EBCDIC    
932 Japanese OEM/ANSI Double byte.
936 Chinese (PRC, Singapore; Simplified) OEM/ANSI Double byte.
949 Korean OEM/ANSI Double byte.
950 Chinese (Taiwan; Hong Kong SAR, PRC; Traditional) OEM/ANSI Double byte. Most common variant: Chinese Traditional (Big5); big5; cn-big5; csbig5; x-x-big5;
1026 EBCDIC    
1200 Unicode (BMP of ISO 10646) ANSI Window NT/2000 and HTML.
Aka: ISO-1604-6; UCS;
  • Unicode (UTF-7): utf-7; csUnicode11UTF7, unicode-1-1-utf-7, x-unicode-2-0-utf-7; 65000.
  • Unicode (UTF-8): utf-8; unicode-1-1-utf-8, unicode-2-0-utf-8, x-unicode-2-0-utf-8; 65001.
1250 Windows 3.1 Eastern European ANSI  
1251 Windows 3.1 Cyrillic ANSI  
1252 Windows 3.1 US (ANSI) ANSI Windows 3.x/9x, Macs, and HTML.
ANSI comes in two versions (the difference is found at decimal 128-159 (hexadecimal 80-9F)):
  • Windows ANSI. Aka: Western European (Windows); windows-1252; US/Western European; Western.
  • ISO Latin 1 ANSI. Aka: Western European (ISO); iso-8859-1; ANSI_X3.4-1968; ANSI_X3.4-1986; ascii; cp367; cp819; csASCII; IBM367; ibm819; iso-ir-100; iso-ir-6; ISO646-US; iso8859-1; ISO_646.irv:1991; iso_8859-1; iso_8859-1:1987; latin1; us; us-ascii; x-ansi; iso-latin-1.
1253 Windows 3.1 Greek ANSI  
1254 Windows 3.1 Turkish ANSI  
1255 Hebrew ANSI  
1256 Arabic ANSI  
1257 Baltic ANSI  
1258 Vietnamese    
1361 Korean (Johab) OEM  
10000 Macintosh Roman   Used in every Mac on the planet. A superset of iso-8859-1 but everything after the ASCII is in a different order. Aka: x-mac-roman.
10001 Macintosh Japanese    
10006 Macintosh Greek I    
10007 Macintosh Cyrillic    
10029 Macintosh Latin 2    
10079 Macintosh Icelandic    
10081 Macintosh Turkish   Aka: x-mac-turkis.

Note that aliases in bold is the preferred charset ID for the HTML <meta> tag:

<meta http-equiv="Content-Type" content="text/html; charset=characterSet">
    <!-- Used to explicitly state the character set used.
         Examples of characterSet include windows-1252, iso-8859-1, and utf-8. / -->

Systems save EOLs (End Of Lines) and EOFs (End Of Files) differently.

2006-06-07 12:29:31Z