vitoco has asked for the wisdom of the Perl Monks concerning the following question:

I'm using WWW::Mechanize (v1.54) to log into a site and extract some data from there. At some pages, I'm getting the "Wide character in print" warning message when saving using $mech->save_content().

I noticed that those pages have explicit charset defined twice and do not match:
Content-type: text/html;charset=ISO-8859-1
in the HTTP response header, and
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
inside the html page.

Doing a trace, $mech->content_type() returns "text/html" and $mech->response()->encoding() returns "iso-8859-1".

It seems that WWW::Mechanize has to check not only the content type for the binmode but also the encoding in save_content() method (and probably more methods), or maybe HTTP::Response is not doing it's work?

Or am I missing something?

Replies are listed 'Best First'.
Re: Explicit charset confuses WWW::Mechanize and/or HTTP::Response
by ikegami (Patriarch) on May 13, 2009 at 22:33 UTC
    save_content never encodes the previously decoded content. Bug!
      The bug is it should binmode unconditionally.

        No. You can't write decoded text to a file, binmoded or not.

        Or do you mean :encoding needs to be specified unconditionally? That's wrong too, since Mechanize doesn't always decode the content.

Re: Explicit charset confuses WWW::Mechanize and/or HTTP::Response
by vitoco (Hermit) on May 15, 2009 at 15:52 UTC

    Finally, I determined that the HTML page is being sent by the webserver using iso-8859-1 (latin-1), not utf-8.

    The problem is that <meta> tag is lying about the encoding of the page (it sais utf-8), and I think that HTTP::Response is decoding the contents based on it and WWW::Mechanize receives that corrupted data, that saves with a wide character warning.

    As I cannot change anything from the remote server, how can I handle this? Is there a way to stop the automagic decoding done by modules and then process that data myself?

    Thanks...

      I am curious about something. Have you tried more than one browser? If your situation is as it seems to me, you may see the proper character handling in Firefox, but IE will fail. (I've been there, done that.) Firefox will read the HTML headers, and respect them. IE does not. IE reads from the initial output to the browser in the content headers. Therefore, you need to do this in your code, before printing anything else to the browser:

      print "Content-type: text/html; charset=utf-8\n\n"; #print CGI::header();

      In other words, the charset must be made utf-8 right from the first exchange to the browser and forward.

      Blessings,

      Polyglot

        Polyglot: in this case, IE displays the page OK, because the proper encoding is what HTTP header sais, as you predicted (the liar is the HTML header). I have not tried Firefox yet, but Opera also renders well those pages. I'm not sure if there is some automatic detection done by those browsers other than what is said both in the HTTP and/or HTML headers. Unfortunately, I can't touch the remote server's code, just live with it...