slugger415 has asked for the wisdom of the Perl Monks concerning the following question:

Hi blessed monks,

I have successfully used the Encode module to change the charset of HTML pages from iso-8859-1 to utf8, thusly:

 if(!(is_utf8($html))){
  from_to($html, "iso-8859-1", "utf8"); 
 }

My question is, how do I determine the charset of an HTML page? If it's not iso-8859-1, I want to do something more like:

 my($charset);
 $charset = what_is_my_charset($html); # ok, I made that up
 if(!(is_utf8($html))){
  from_to($html, $charset, "utf8"); 
 }

Any thoughts on how I'd do this? I've been looking Encode but haven't found what I'm seeking.

thanks,
Scott

Replies are listed 'Best First'.
Re: How to determine HTML encoding
by ikegami (Patriarch) on Jun 30, 2010 at 00:31 UTC

    is_utf8 does not say anything about the encoding of a string, just how it's stored internally.

    $ perl -le' use Encode qw( encode is_utf8 ); my $x = "\xC9ric"; print is_utf8(encode("iso-8859-1", $x))?1:0; ' 0 $ perl -le' use Encode qw( encode is_utf8 ); my $x = "\xC9ric"; print is_utf8(encode("UTF-8", $x))?1:0; ' 0

    You determine the charset of an HTML page by looking at the Content-Type header of the HTTP response.

    $ perl -le' use LWP::UserAgent; print join " ", LWP::UserAgent->new ->get("http://www.google.ca/")->content_type; ' text/html charset=ISO-8859-1 $ perl -le' use LWP::UserAgent; print join " ", LWP::UserAgent->new ->get("http://www.microsoft.com/")->content_type; ' text/html charset=utf-8

    Since this doesn't work when the HTML is stored in a file, the http-equiv tag was created.

    $ perl -le' use LWP::UserAgent; my $html = LWP::UserAgent->new ->get("http://www.google.ca/")->decoded_content; print for $html =~ /(<meta[^>]*>)/ig; ' <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1 +">

    That said, you don't need to do ANY of this. You just use HTTP::Response's ->decoded_content method and it will decode the content for you. Then, if you need it encoded, just use encode unconditionally. (Although you may need to adjust the http-equiv header...)

      Ok I'm going to be dumb here, but I'm not getting HTTP::Response to work.

      my $r = HTTP::Response->new($url); print "r: ", $r->decoded_content, "\n";

      ... returns nothing. What am I doing wrong?
      Scott

        You're creating a HTTP::Response out of thin air. It does not run off to the server and fetches $url. Maybe you want to actually fetch content through LWP::UserAgent or WWW::Mechanize and then use the ->decoded_content method on whatever response they give you?