WWW::Mechanize encoding again...

GaijinPunch has asked for the wisdom of the Perl Monks concerning the following question:

I was recently (and graciously) informed here that later versions of www::mechanize will automatically encode pages (in utf8 it would seem). This genereally isn't a problem as I can convert the text to my encoding of choice (euc in this case) with Jcode. However, I've hit a block.

## The below is fine and dandy
my $page = $mech->content();
$page = Jcode->new( $page )->euc();

## This is not
my %fields;
# read fields from form.
$fields{"comment"} = "EUC encoded string";
$mech->submit_form( form_number => 1, fields => \%fields );
[download]

$mech is still holding the content of the previously loaded page in utf8. Even though I pass it an EUC encoded string, it sends it as utf8 (mojibake in this case). Any way around this?

Comment on WWW::Mechanize encoding again... Download Code

Replies are listed 'Best First'.
Re: WWW::Mechanize encoding again... by Anonymous Monk on Dec 07, 2008 at 09:50 UTC
I think the encoding is determined by the page/form... create your own HTTP request?	[reply]
Re^2: WWW::Mechanize encoding again... by GaijinPunch (Pilgrim) on Dec 07, 2008 at 10:19 UTC
The pages I'm reading are all euc-jp encoded. Mechanize forces them into utf8.	[reply]
Re: WWW::Mechanize encoding again... by graff (Chancellor) on Dec 08, 2008 at 04:26 UTC
I'm a little out of my depth here, but have you tried: `$mech->add_header( Encoding => 'OFFICIAL_ENCODING_LABEL_FOR_YOUR_DATA' + );` [download] That is, provide the string that would normally be used to identify the encoding of your "EUC encoded string" in the http and/or html header. Apart from that, it appears (from looking at the source for WWW::Mechanize) that there's nothing in the module to alter the encoding of data being passed through it to a given web server. So if you're having trouble getting the strings to go through correctly, you may need to exercise more control in your own code to manage encoding issues. For example, there are some WM methods that allow you to pass an optional file handle for output; if you don't supply one, it prints to STDOUT in the "normal" (default) manner. This should usually do the right thing, but if/when it doesn't, maybe you need to supply a file handle with the correct encoding layer already assigned via the 3-arg open call. BTW, is it really the case that referring to your data as "EUC" is sufficient to identify what it really is? I've seen references to 'euc-jp' and 'euc-kr' and 'euc-cn', but not the simple "euc" by itself -- which makes me wonder if maybe "euc" by itself is ambiguous... (Maybe your reference to "Jcode" would peg it to Japanese, but I wouldn't know.)	[reply] [d/l]