in reply to www:mechanize mangles unicode

Let's start with a server-side script

#!/usr/bin/perl use strict; use warnings; use CGI; use Encode qw( decode ); use HTML::Entities qw( encode_entities ); my $cgi = CGI->new(); my $val = $cgi->param('key'); use Devel::Peek; Dump($val); $val = decode('iso-8859-15', $val) if defined($val); print $cgi->header('text/html; charset=iso-8859-15'); binmode STDOUT, ':encoding(iso-8859-15)'; my $val_initializer = ( defined($val) ? sprintf(' value="%s"', encode_entities($val, '<>&"')) : '' ); print(<<"__EOI__"); <title>Test</title> <form method="POST"> <input type="text" name="key"$val_initializer> <input type="submit"> </form> __EOI__

Let's make sure it works:

$ perl -e'print <<"__EOI__"; POST /zzz.cgi HTTP/1.0 Host: www.example.com Content-Length: 11 key=Ch\xE2teau __EOI__ ' | nc www.example.com 80 | od -c 00000 H T T P / 1 . 1 2 0 0 O K \r 00020 \n D a t e : W e d , 2 8 A 00040 p r 2 0 1 0 2 2 : 1 0 : 1 4 00060 G M T \r \n S e r v e r : A p 00100 a c h e \r \n V a r y : A c c e 00120 p t - E n c o d i n g \r \n C o n 00140 t e n t - L e n g t h : 1 1 8 00160 \r \n C o n n e c t i o n : c l 00200 o s e \r \n C o n t e n t - T y p 00220 e : t e x t / h t m l ; c h 00240 a r s e t = i s o - 8 8 5 9 - 1 00260 5 \r \n \r \n < t i t l e > T e s t 00300 < / t i t l e > \n < f o r m m 00320 e t h o d = " P O S T " > \n < i 00340 n p u t t y p e = " t e x t " 00360 n a m e = " k e y " v a l u 00400 e = " C h 342 t e a u " > \n < i n 00420 p u t t y p e = " s u b m i t 00440 " > \n < / f o r m > \n 00453

Yup. Now let's test WWW::Mechanize.

use strict; use warnings; use open ':std', ':locale'; use charnames ':full'; use Encode qw( encode ); use WWW::Mechanize qw( ); # Avoiding script encoding issues. my $val = "Ch\N{LATIN SMALL LETTER A WITH CIRCUMFLEX}teau"; my $mech = WWW::Mechanize->new( autocheck => 1 ); $mech->get('http://www.server.com/zzz.cgi'); $mech->field('key', $val); $mech->submit(); #print($mech->value('key'), "\n"); use Devel::Peek qw( Dump ); Dump($mech->value('key'));
Hum, I get:
SV = PV(0x1167c20) at 0x11d05c0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x11572c0 "Ch\303\203\302\242teau"\0 [UTF8 "Ch\x{c3}\x{a2}teau" +] CUR = 10 LEN = 16
But I expect:
SV = PV(0x1167c20) at 0x11d05c0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x115bcf0 "Ch\303\242teau"\0 [UTF8 "Ch\x{e2}teau"] CUR = 8 LEN = 16
or the equivalent
SV = PV(0x1167c20) at 0x11d05c0 REFCNT = 1 FLAGS = (TEMP,POK,pPOK) PV = 0x115bcf0 "Ch\342teau"\0 CUR = 7 LEN = 16

Some debugging shows the server side is receiving the following:

"Ch\303\242teau"

That's the UTF-8 encoding of the value, so the problem is getting the right data to the server. Ok, fine, maybe WWW::Mechanize stupidly sends the internal storage data of the string. The solution would be to encode the inputs yourself as follows:

#$mech->field('key', $val); $mech->field('key', encode('iso-8859-15', $val)); $mech->submit();

But even with the change, the client side script is still sending the following to the server:

"Ch\303\242teau"

That's the UTF-8 encoding of the result of encode('iso-8859-15', $val). Does WWW::Mechanize assume the server expects UTF-8 rather than the page's encoding?

It's all I have time for right now.

Replies are listed 'Best First'.
Re^2: www:mechanize mangles unicode
by ikegami (Patriarch) on Apr 28, 2010 at 22:53 UTC

    Found the bug.

    For starters, everything works fine if the server sends

    <form method="POST" accept-charset="iso-8859-15">

    HTML::Form (used by WWW::Mechanize) processes that attribute and generates the correct form data. The bug is that WWW::Mechanize doesn't inform HTML::Form of the page's charset, leaving HTML::Form with no idea what to do when accept-charset is missing. (It defaults to using UTF-8.)

    Some may not consider this a bug since the spec simple recommends the behaviour, but it's what other browsers do.

      Wow. Thanks.

      Now, I'm searching for how to tell HTML::Form which character set to use from the client side

        HTML::Form->parser(..., charset => $encoding)
        but you can do it after the fact with
        $form->accept_charset($encoding)
Re^2: www:mechanize mangles unicode
by Desmond Walsh (Initiate) on Nov 27, 2015 at 00:10 UTC

    I have encountered the same problem of form parameters being converted to UTF-8 when using WWW::Mechanize to access a web site.

    My application automates access to a dictionary web site in order to build a grammar database. The initial web page has one HTML form. My application inserts a starting head word in that form. The response web page has a second form that contains the next head word in the dictionary. By making that form the current form a loop process is established to retrieve N conseecutive words from the dictionary.

    The trouble is that Perl converts accented characters (the 5 vowels á í é ó ú uc and lc) into UTF-8 and the dictionary misreads all words with these characters

    Looking at the HTML, I find the settings ;

    'default_charset' => 'windows-1252', 'enctype' => 'application/x-www-form-urlencoded', 'accept_charset' => 'UNKNOWN',

    I presume that I need to change this to 'accept_charset' => 'UTF-8'. However, I do not see any method in Mechanize that will allow me to do this. Is it possible to use HTML::Form with Mechanize to do this

    Would appreciate any help from members of the forum. Thank you

    use strict; use warnings; use WWW::Mechanize; use Encode qw(encode decode); # Tried encode, decode # without success # Create a new browser my $browser = WWW::Mechanize->new(autocheck => 1 ); # Tell it to get the main page $browser->get("http://193.1.97.44/focloir/"); # Okay, fill in the form with the first word to look up $browser->form_number(1); # Select first as active form $browser->field("WORD", "acht"); # Next word in dict is achtú # Get a consecutive sequence of words, one word per web request for ($i=1; $i<=2; $i++) { $browser->dump_forms(); # i=1 WORD parameter is acht Hex: [61 63 68 74] # i=2 WORD parameter is achtú Hex: [61 63 68 74 FA] $browser->click(); # Make the Web request print $browser->content; # i=1 Word found # i=2 Message: could not find - # acht&#195;&#186; # Hex: [61 63 68 74 c3 ba] # which is achtú in UTF-8 sleep (1); # Just in case we get into # trouble with the web server # # Pick the second form. It should have the next head word # already filled in # NOTE: application code does not access any parameters on # this form $browser->form_number(2); # Select second form as # active form }

      I have resolved the issue. The answer was already supplied in an earlier posting from ikegami. I did not understand how to call the HTML::Form method accept_charset

      Below is the modified code that now handles accented input correctly

      Really appreciate all the wisdom lurking in this forum

      use strict; use warnings; use WWW::Mechanize; # Create a new browser my $browser = WWW::Mechanize->new(autocheck => 1 ); # Tell it to get the main page $browser->get("http://193.1.97.44/focloir/"); # Okay, fill in the form with the first word to look up $browser->form_number(1); # Select first as active form # # This is the patch to specify input character set $browser->form_number(2)->accept_charset ("iso-8859-15"); $browser->field("WORD", "acht"); # Next word in dict is achtú # Get a consecutive sequence of words, one word per web request for (my $i=1; $i<=2; $i++) { $browser->dump_forms(); # i=1 WORD parameter is acht # Hex: [61 63 68 74] # i=2 WORD parameter is achtú # Hex: [61 63 68 74 FA] $browser->click(); # Make the Web request print $browser->content; # i=1 Word found # i=2 Message: could not find - # acht&#195;&#186; # Hex: [61 63 68 74 c3 ba] # which is achtú in UTF-8 sleep (1); # Just in case we get into # trouble with the web server # # Pick the second form. It should have the next head word # already filled in # NOTE: application code does not access any parameters on # this form $browser->form_number(2); # Select second form as # active form # # This is the patch to specify input character set $browser->form_number(2)->accept_charset ("iso-8859-15"); + }
Re^2: www:mechanize mangles unicode
by red0hat (Initiate) on Apr 28, 2010 at 22:57 UTC

    That is far better written than could produce.

    Eventually, I got to much the same place. Perhaps there is something fishy happening is HTTP::Form?

    btw

    perl 5.10.1

    WWW::Mechanize 1.62