Header, start_html and encodings

Nik has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, Can someone of you please explain to me as simple as he can the difference between
print header( -charset=>'utf8' );
and
print start_html( -charset=>'utf8' );

What they do exactly? I beleive the HTTP header tells the cleint browser that he is about to transit to him all the pages data to utf-8 right? So why we have to respecify the same thing in the HTML header as well? Isnt it one time enough? What exactly print start_html( -charset=>'utf8' ) does that header doesnt?Please tell em the differences.

Also iam havign some diufficulty understaning encodings. I wuld be gratefull if you can put it is simple words that too.
As i understand it so far i beleive that data is data and encodinfs are actually differents ways of storing/viewing data. is this correct? But why the sue of so many encodings?

Comment on Header, start_html and encodings Select or Download Code

Replies are listed 'Best First'.
Re: Header, start_html and encodings by bpphillips (Friar) on May 24, 2006 at 17:42 UTC
Presuming you're talking about the `header` and `start_html` methods that are part of CGI... `header` specifies the character encoding in the HTTP content-type header (the part of an HTTP server response that the user doesn't see when you do a view-source): `Content-type: text/html; charset=UTF-8` [download] `start_html` specifies the character encoding as part of a <meta> tag within the body of the HTML document: `<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">` [download] Assuming a browser is standards compliant, both methods of specifying the content encoding would accomplish the same thing. --Brian	[reply] [d/l] [select]
Re^2: Header, start_html and encodings by dorward (Curate) on May 25, 2006 at 08:26 UTC
The spec is a bit fuzzy when it comes to HTTP-EQUIV. While it does say: The META element may be used to specify the default information for a document and The following example specifies the character encoding for a document as being ISO-8859-5 `<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-5">` It also says: http-equiv = name CI This attribute may be used in place of the name attribute. HTTP servers use this attribute to gather information for HTTP response message headers. Although the only server I'm aware of which actually does this is Russian Apache (you have to have a way of specifing http headers for all media types, so why bother having an HTML specific one?). I can't find anywhere in the spec that says that user agents that are not servers should pay any attention to HTTP-EQIV whatsoever. There is also the issue that it is rather difficult to read a document if the character encoding is unknown. If the only way to find out the character encoding is to read the document, then you have a problem. Real HTTP headers are the way to go, and I'm not aware of any user agent that has a problem with them.	[reply] [d/l]
Re: Header, start_html and encodings by jhourcle (Prior) on May 24, 2006 at 17:28 UTC
Because there are many badly written web browsers out there, and although there are specifications that explain how they're generally supposed to behave, a significant number of them don't behave correctly, or have abnormal behavior on edge cases. Basically, some browsers look in the HTTP header, others look in the HTML, and no one has the time to test every permutation of their generated pages with even a fraction of the browsers out there. (warning -- the evolt link is down right now, or blocked from my work ... I'm just putting it in with the chance that it comes back up) As for why there are so many encodings -- it's because of the old days of 8 bit computing, when you only had 256 different characters that could be stored, and so you couldn't have a character set with both all latin and all cyrillic characters. (and eastern languages? not a chance). Although there are multiple UTF standards, it's typically best to move towards one of them. If you're using primarily latin languages, UTF8 will typically take up less space.	[reply]