in reply to How to determine HTML encoding

is_utf8 does not say anything about the encoding of a string, just how it's stored internally.

$ perl -le' use Encode qw( encode is_utf8 ); my $x = "\xC9ric"; print is_utf8(encode("iso-8859-1", $x))?1:0; ' 0 $ perl -le' use Encode qw( encode is_utf8 ); my $x = "\xC9ric"; print is_utf8(encode("UTF-8", $x))?1:0; ' 0

You determine the charset of an HTML page by looking at the Content-Type header of the HTTP response.

$ perl -le' use LWP::UserAgent; print join " ", LWP::UserAgent->new ->get("http://www.google.ca/")->content_type; ' text/html charset=ISO-8859-1 $ perl -le' use LWP::UserAgent; print join " ", LWP::UserAgent->new ->get("http://www.microsoft.com/")->content_type; ' text/html charset=utf-8

Since this doesn't work when the HTML is stored in a file, the http-equiv tag was created.

$ perl -le' use LWP::UserAgent; my $html = LWP::UserAgent->new ->get("http://www.google.ca/")->decoded_content; print for $html =~ /(<meta[^>]*>)/ig; ' <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1 +">

That said, you don't need to do ANY of this. You just use HTTP::Response's ->decoded_content method and it will decode the content for you. Then, if you need it encoded, just use encode unconditionally. (Although you may need to adjust the http-equiv header...)

Replies are listed 'Best First'.
Re^2: How to determine HTML encoding
by slugger415 (Monk) on Jun 30, 2010 at 18:23 UTC

    Ok I'm going to be dumb here, but I'm not getting HTTP::Response to work.

    my $r = HTTP::Response->new($url); print "r: ", $r->decoded_content, "\n";

    ... returns nothing. What am I doing wrong?
    Scott

      You're creating a HTTP::Response out of thin air. It does not run off to the server and fetches $url. Maybe you want to actually fetch content through LWP::UserAgent or WWW::Mechanize and then use the ->decoded_content method on whatever response they give you?

        Yes I see, thank you. Scott

        Ok, so I've got some HTML content with my own subroutine:

        my $html = get_content($url);
        print $html;
        

        Result is I get the contents of $html printed, fine.

        Then I want to decode it:

        print HTTP::Response->decoded_content($html);
        

        Result:

        Odd number of elements in hash assignment at C:/Perl/lib/HTTP/Message.pm line 28
        9.
        Use of uninitialized value in print at myscript.pl line 269.
        

        What's going on? I read the doc on decoded_content and messages and it says something about options I don't understand.

        $mess->decoded_content( %options )
        

        Scott