On this page explores how character sets pertain to webs and how to administer and choose character sets for web pages. See also Storing HTML in Databases.
Before discussing character sets as they pertain to web pages, a few terms must be clarified:
<meta http-equiv="Content-Type" content="text/html; charset=CharacterSet">. Note that emails would use something like content-type: text/plain; charset="utf-8".<meta http-equiv="Content-Language" content="SpellingLanguage">.<span lang="SpellingLanguage">.In a single-language situation, the language settings are set once and forgotten. However, in a multi-language situation, the settings may have to be customized as pages are made.
There are different places and reasons why a person might use different language settings.
Content-Type <meta> tag is entered in the HTML. EG: In MS FrontPage, this is accessed with the Tools menu > Web Settings option > Language tab > Default Page Encoding box.
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> into the <head> tag of the new pages. Microsoft should give you the option to have the default for new pages as ISO Latin 1 (is0-8859-1), but they only offer "US/Western European" which is actually windows-1252.! At least they give you the options for Unicode (UTF-16LE), Unicode big endian (UTF-16BE), or UTF-8. some Unicode options, which is useful for later versions of Windows which can actually make pages encoded in Unicode. My personal opinion is that most pages should use utf-8.Content-Language <meta> tag into the HTML.Content-Type <meta> tag in the HTML.Content-Language <meta> tag in the HTML. It will also (very sneakily) enter the GENERATOR and ProgId <meta> tags in the HTML.<span lang="SpellingLanguage">. EG: In MS FrontPage, this is done by selecting the text > Tools menu > Set Language option.When a web document reaches a user, the browser sequentially checks the following to decide which character set it will use to translate the bytes received into characters:
Content-Type <meta> tag in the header.When deciding on a character set, take the following into account:
à or à, while the CER is à.One would think that defaulting to windows-1252 would be a fine character set because it is a super set of iso-8859-1 (code points 128-255 (x80-xFF)) and you can use NCR or CER for Unicode characters. EG: Entering the euro sign on your system (€) will enter the byte for 128 (0x80). If the viewer has a different character set than windows-1252, then that byte may be interpreted as something all together different.
One would also think that specifying iso-8859-1 as the character set would be fine. However the future is globalization and that means using a Unicode character set. A particular problem with this is that although Unicode is a super set of iso-8859-1 for code points, the character encoding may be different! EG: For the Yen character (¥), both iso-8859-1 and Unicode have the same code point of 165 = 0xA5. However iso-8859-1 encodes the Yen as a byte of 165 = 0xA5 = 10100101, while utf-8 would encode it as 11000010 10100101 = C2 A5. If a utf-8 reader came across the iso-8859-1 byte for the Yen, it would have trouble parsing it.
My opinion is that all pages should use utf-8 as the character set.
Character sets are often encoded in plain text documents such as HTML and XML using a Character Reference, especially for characters that are difficult to enter via a keyboard or they have particular meaning in the syntax. Character references are either NCR (Numeric Character References) or CER (Character Entity Reference). CERs use symbolic names so that authors need not remember code points. EG: For the Yen character (¥), the NCR is either ¥ or ¥, while the CER is ¥.
HTML has 252 CERs, while XHTML has 253 CERs, because ' (') is defined in XML and XHTML but not in HTML. XML has 5 CERs or Predefined CERs (PCERs). For XML and XHTML, the 5 PCERs must be used outside of the tags/elements/entities. Here are the 5 PCERs:
" = " = " = " = QUOTATION MARK.& = & = & = & = AMPERSAND. This one is often missed in URLs. EG: It should be http://fake.com?x=1&y=2 instead of http://fake.com?x=1&y=2.' = ' = ' = ' = APOSTROPHE. Possibly a SQL string issue too. 2005/2007, MSIE does not recognize ' but does recognize '. View the next few words in IE and other browsers: Don't with '; Don't with '; Don't with '.< = < = < = < = LESS-THAN SIGN> = > = > = > = GREATER-THAN SIGNSee also CERs in HTML [w3.org/TR/html401/sgml/entities.html], SYMBOL Characters and Glyphs [w3.org/Math/characters/html/symbol.html], and List of XML and HTML character entity references [W].
CERs are case sensitive. EG:
Σ.σ.2007-07-29 01:44:14Z