is_utf8 does not say anything about the encoding of a string, just how it's stored internally.
$ perl -le' use Encode qw( encode is_utf8 ); my $x = "\xC9ric"; print is_utf8(encode("iso-8859-1", $x))?1:0; ' 0 $ perl -le' use Encode qw( encode is_utf8 ); my $x = "\xC9ric"; print is_utf8(encode("UTF-8", $x))?1:0; ' 0
You determine the charset of an HTML page by looking at the Content-Type header of the HTTP response.
$ perl -le' use LWP::UserAgent; print join " ", LWP::UserAgent->new ->get("http://www.google.ca/")->content_type; ' text/html charset=ISO-8859-1 $ perl -le' use LWP::UserAgent; print join " ", LWP::UserAgent->new ->get("http://www.microsoft.com/")->content_type; ' text/html charset=utf-8
Since this doesn't work when the HTML is stored in a file, the http-equiv tag was created.
$ perl -le' use LWP::UserAgent; my $html = LWP::UserAgent->new ->get("http://www.google.ca/")->decoded_content; print for $html =~ /(<meta[^>]*>)/ig; ' <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1 +">
That said, you don't need to do ANY of this. You just use HTTP::Response's ->decoded_content method and it will decode the content for you. Then, if you need it encoded, just use encode unconditionally. (Although you may need to adjust the http-equiv header...)
In reply to Re: How to determine HTML encoding
by ikegami
in thread How to determine HTML encoding
by slugger415
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |