On this page explores how character sets pertain to webs and how to administer and choose character sets for web pages. See also Storing HTML in Databases.

Character Set Terminology

Before discussing character sets as they pertain to web pages, a few terms must be clarified:

Administering Character Sets

In a single-language situation, the language settings are set once and forgotten. However, in a multi-language situation, the settings may have to be customized as pages are made.

There are different places and reasons why a person might use different language settings.

When a web document reaches a user, the browser sequentially checks the following to decide which character set it will use to translate the bytes received into characters:

  1. The HTTP content type returned by the server.
  2. The character set specified by the Content-Type <meta> tag in the header.

Choosing Character Sets

When deciding on a character set, take the following into account:

  1. If a site or page is written predominantly in a particular real language, choose the appropriate character set. EG: iso-8859-3 is an 8-bit character set for Esperanto.
  2. If a site or page has multiple languages with considerably different characters, choose Unicode (probably utf-8).
  3. If a page has a few characters that are not part of the character set for the current page, then those characters can be entered as its equivalent using either NCR (Numeric Character References) or CER (Character Entity Reference). CERs use symbolic names so that authors need not remember code points. EG: For the character a with a grave accent (à), the NCR is either &#224; or &#xE0;, while the CER is &agrave;.

One would think that defaulting to windows-1252 would be a fine character set because it is a super set of iso-8859-1 (code points 128-255 (x80-xFF)) and you can use NCR or CER for Unicode characters. EG: Entering the euro sign on your system (€) will enter the byte for 128 (0x80). If the viewer has a different character set than windows-1252, then that byte may be interpreted as something all together different.

One would also think that specifying iso-8859-1 as the character set would be fine. However the future is globalization and that means using a Unicode character set. A particular problem with this is that although Unicode is a super set of iso-8859-1 for code points, the character encoding may be different! EG: For the Yen character (¥), both iso-8859-1 and Unicode have the same code point of 165 = 0xA5. However iso-8859-1 encodes the Yen as a byte of 165 = 0xA5 = 10100101, while utf-8 would encode it as 11000010 10100101 = C2 A5. If a utf-8 reader came across the iso-8859-1 byte for the Yen, it would have trouble parsing it.

My opinion is that all pages should use utf-8 as the character set.

Character References

Character sets are often encoded in plain text documents such as HTML and XML using a Character Reference, especially for characters that are difficult to enter via a keyboard or they have particular meaning in the syntax. Character references are either NCR (Numeric Character References) or CER (Character Entity Reference). CERs use symbolic names so that authors need not remember code points. EG: For the Yen character (¥), the NCR is either &#165; or &#xA5;, while the CER is &yen;.

HTML has 252 CERs, while XHTML has 253 CERs, because &apos; (') is defined in XML and XHTML but not in HTML. XML has 5 CERs or Predefined CERs (PCERs). For XML and XHTML, the 5 PCERs must be used outside of the tags/elements/entities. Here are the 5 PCERs:

See also CERs in HTML [w3.org/TR/html401/sgml/entities.html], SYMBOL Characters and Glyphs [w3.org/Math/characters/html/symbol.html], and List of XML and HTML character entity references [W].

CERs are case sensitive. EG:

Page Modified: (Hand noted: 2007-07-29 01:44:14Z) (Auto noted: 2008-03-12 17:20:37Z)