in reply to How to determine HTML encoding
is_utf8 does not say anything about the encoding of a string, just how it's stored internally.
$ perl -le' use Encode qw( encode is_utf8 ); my $x = "\xC9ric"; print is_utf8(encode("iso-8859-1", $x))?1:0; ' 0 $ perl -le' use Encode qw( encode is_utf8 ); my $x = "\xC9ric"; print is_utf8(encode("UTF-8", $x))?1:0; ' 0
You determine the charset of an HTML page by looking at the Content-Type header of the HTTP response.
$ perl -le' use LWP::UserAgent; print join " ", LWP::UserAgent->new ->get("http://www.google.ca/")->content_type; ' text/html charset=ISO-8859-1 $ perl -le' use LWP::UserAgent; print join " ", LWP::UserAgent->new ->get("http://www.microsoft.com/")->content_type; ' text/html charset=utf-8
Since this doesn't work when the HTML is stored in a file, the http-equiv tag was created.
$ perl -le' use LWP::UserAgent; my $html = LWP::UserAgent->new ->get("http://www.google.ca/")->decoded_content; print for $html =~ /(<meta[^>]*>)/ig; ' <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1 +">
That said, you don't need to do ANY of this. You just use HTTP::Response's ->decoded_content method and it will decode the content for you. Then, if you need it encoded, just use encode unconditionally. (Although you may need to adjust the http-equiv header...)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: How to determine HTML encoding
by slugger415 (Monk) on Jun 30, 2010 at 18:23 UTC | |
by Corion (Patriarch) on Jun 30, 2010 at 18:45 UTC | |
by slugger415 (Monk) on Jun 30, 2010 at 23:36 UTC | |
by slugger415 (Monk) on Jun 30, 2010 at 23:50 UTC | |
by ikegami (Patriarch) on Jul 01, 2010 at 00:50 UTC | |
by slugger415 (Monk) on Jul 01, 2010 at 16:48 UTC | |
|