Explicit charset confuses WWW::Mechanize and/or HTTP::Response

vitoco has asked for the wisdom of the Perl Monks concerning the following question:

I'm using WWW::Mechanize (v1.54) to log into a site and extract some data from there. At some pages, I'm getting the "Wide character in print" warning message when saving using $mech->save_content().

I noticed that those pages have explicit charset defined twice and do not match:
Content-type: text/html;charset=ISO-8859-1
in the HTTP response header, and
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
inside the html page.

Doing a trace, $mech->content_type() returns "text/html" and $mech->response()->encoding() returns "iso-8859-1".

It seems that WWW::Mechanize has to check not only the content type for the binmode but also the encoding in save_content() method (and probably more methods), or maybe HTTP::Response is not doing it's work?

Or am I missing something?

Comment on Explicit charset confuses WWW::Mechanize and/or HTTP::Response Select or Download Code

Replies are listed 'Best First'.
Re: Explicit charset confuses WWW::Mechanize and/or HTTP::Response by ikegami (Patriarch) on May 13, 2009 at 22:33 UTC
`save_content` never encodes the previously decoded content. Bug!	[reply] [d/l]
Re^2: Explicit charset confuses WWW::Mechanize and/or HTTP::Response by Anonymous Monk on May 14, 2009 at 04:05 UTC
The bug is it should binmode unconditionally.	[reply]
Re^3: Explicit charset confuses WWW::Mechanize and/or HTTP::Response by ikegami (Patriarch) on May 14, 2009 at 14:24 UTC
No. You can't write decoded text to a file, binmoded or not. Or do you mean `:encoding` needs to be specified unconditionally? That's wrong too, since Mechanize doesn't always decode the content.	[reply] [d/l]
Re: Explicit charset confuses WWW::Mechanize and/or HTTP::Response by vitoco (Hermit) on May 15, 2009 at 15:52 UTC
Finally, I determined that the HTML page is being sent by the webserver using iso-8859-1 (latin-1), not utf-8. The problem is that `<meta>` tag is lying about the encoding of the page (it sais utf-8), and I think that HTTP::Response is decoding the contents based on it and WWW::Mechanize receives that corrupted data, that saves with a wide character warning. As I cannot change anything from the remote server, how can I handle this? Is there a way to stop the automagic decoding done by modules and then process that data myself? Thanks...	[reply] [d/l]
Re^2: Explicit charset confuses WWW::Mechanize and/or HTTP::Response by Polyglot (Chaplain) on May 16, 2009 at 14:49 UTC
I am curious about something. Have you tried more than one browser? If your situation is as it seems to me, you may see the proper character handling in Firefox, but IE will fail. (I've been there, done that.) Firefox will read the HTML headers, and respect them. IE does not. IE reads from the initial output to the browser in the content headers. Therefore, you need to do this in your code, before printing anything else to the browser: print "Content-type: text/html; charset=utf-8\n\n"; #print CGI::header(); In other words, the charset must be made utf-8 right from the first exchange to the browser and forward. Blessings, Polyglot	[reply]
Re^3: Explicit charset confuses WWW::Mechanize and/or HTTP::Response by vitoco (Hermit) on May 18, 2009 at 22:30 UTC
Polyglot: in this case, IE displays the page OK, because the proper encoding is what HTTP header sais, as you predicted (the liar is the HTML header). I have not tried Firefox yet, but Opera also renders well those pages. I'm not sure if there is some automatic detection done by those browsers other than what is said both in the HTTP and/or HTML headers. Unfortunately, I can't touch the remote server's code, just live with it...	[reply]