You have to know the encoding, and you have have to know it first, not after mish-mashing things together and guessing and hoping and blaming the spec.
In an ideal world, with HTML spec'd and written for a single-pass parser, yes. In this world, no. Any browser procesing HTML, valid or tag soup, classic or XHTML, is generally using several steps to process input. One of them is to find the encoding. A HTTP "Content-Type" header with a "charset" is one of the ways to find out the encoding, meta tags are a second way, and Byte Order Marks are also used, plus a lot of heuristics.
That works quite well:
- A "charset" information from a "Content-Type" header is a good first guess.
- A BOM is very easy, just a very specific byte sequence for each UTF-encoding at the start of the document.
- Without a BOM, UTF-16 and UTF-32 can easily be guessed due to the very specific mix of 0x00 bytes and non-0x00 bytes.
- Without a BOM, UTF-8 has a lot of restrictions for bytes >= 0x80, if none of those restrictions is violated, it is very likely that the input is UTF-8.
- UTF-8 and most other encodings used are a superset of ASCII, so bytes 0x00 to 0x7F can be treated as ASCII in a first pass. (0x80 to 0xFF are just some line noise at this point.) HTML element and attribute names are limited to ASCII, as are "charset" (encoding) names used in HTTP headers and attribute values. You do not need to know the exact encoding at this point. You just have to know how to read the ASCII characters. As written above, this means one of five ways: 8-Bit ASCII superset (including UTF-8), UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE.
- Now that the program has a usable guess of the encoding, it can search for meta tags, ignoring almost everything else. From http-equiv and charset attributes, it can read the encoding actually used; and start parsing the entire document.
The order used may differ from browser to browser, but a readable meta tag usually wins over HTTP headers.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.