in reply to Re: www:mechanize mangles unicode
in thread www:mechanize mangles unicode

I have encountered the same problem of form parameters being converted to UTF-8 when using WWW::Mechanize to access a web site.

My application automates access to a dictionary web site in order to build a grammar database. The initial web page has one HTML form. My application inserts a starting head word in that form. The response web page has a second form that contains the next head word in the dictionary. By making that form the current form a loop process is established to retrieve N conseecutive words from the dictionary.

The trouble is that Perl converts accented characters (the 5 vowels á í é ó ú uc and lc) into UTF-8 and the dictionary misreads all words with these characters

Looking at the HTML, I find the settings ;

'default_charset' => 'windows-1252', 'enctype' => 'application/x-www-form-urlencoded', 'accept_charset' => 'UNKNOWN',

I presume that I need to change this to 'accept_charset' => 'UTF-8'. However, I do not see any method in Mechanize that will allow me to do this. Is it possible to use HTML::Form with Mechanize to do this

Would appreciate any help from members of the forum. Thank you

use strict; use warnings; use WWW::Mechanize; use Encode qw(encode decode); # Tried encode, decode # without success # Create a new browser my $browser = WWW::Mechanize->new(autocheck => 1 ); # Tell it to get the main page $browser->get("http://193.1.97.44/focloir/"); # Okay, fill in the form with the first word to look up $browser->form_number(1); # Select first as active form $browser->field("WORD", "acht"); # Next word in dict is achtú # Get a consecutive sequence of words, one word per web request for ($i=1; $i<=2; $i++) { $browser->dump_forms(); # i=1 WORD parameter is acht Hex: [61 63 68 74] # i=2 WORD parameter is achtú Hex: [61 63 68 74 FA] $browser->click(); # Make the Web request print $browser->content; # i=1 Word found # i=2 Message: could not find - # acht&#195;&#186; # Hex: [61 63 68 74 c3 ba] # which is achtú in UTF-8 sleep (1); # Just in case we get into # trouble with the web server # # Pick the second form. It should have the next head word # already filled in # NOTE: application code does not access any parameters on # this form $browser->form_number(2); # Select second form as # active form }

Replies are listed 'Best First'.
Re^3: www:mechanize mangles unicode
by Desmond Walsh (Initiate) on Nov 27, 2015 at 21:20 UTC

    I have resolved the issue. The answer was already supplied in an earlier posting from ikegami. I did not understand how to call the HTML::Form method accept_charset

    Below is the modified code that now handles accented input correctly

    Really appreciate all the wisdom lurking in this forum

    use strict; use warnings; use WWW::Mechanize; # Create a new browser my $browser = WWW::Mechanize->new(autocheck => 1 ); # Tell it to get the main page $browser->get("http://193.1.97.44/focloir/"); # Okay, fill in the form with the first word to look up $browser->form_number(1); # Select first as active form # # This is the patch to specify input character set $browser->form_number(2)->accept_charset ("iso-8859-15"); $browser->field("WORD", "acht"); # Next word in dict is achtú # Get a consecutive sequence of words, one word per web request for (my $i=1; $i<=2; $i++) { $browser->dump_forms(); # i=1 WORD parameter is acht # Hex: [61 63 68 74] # i=2 WORD parameter is achtú # Hex: [61 63 68 74 FA] $browser->click(); # Make the Web request print $browser->content; # i=1 Word found # i=2 Message: could not find - # acht&#195;&#186; # Hex: [61 63 68 74 c3 ba] # which is achtú in UTF-8 sleep (1); # Just in case we get into # trouble with the web server # # Pick the second form. It should have the next head word # already filled in # NOTE: application code does not access any parameters on # this form $browser->form_number(2); # Select second form as # active form # # This is the patch to specify input character set $browser->form_number(2)->accept_charset ("iso-8859-15"); + }