I have encountered the same problem of form parameters being converted to UTF-8 when using WWW::Mechanize to access a web site.

My application automates access to a dictionary web site in order to build a grammar database. The initial web page has one HTML form. My application inserts a starting head word in that form. The response web page has a second form that contains the next head word in the dictionary. By making that form the current form a loop process is established to retrieve N conseecutive words from the dictionary.

The trouble is that Perl converts accented characters (the 5 vowels á í é ó ú uc and lc) into UTF-8 and the dictionary misreads all words with these characters

Looking at the HTML, I find the settings ;

'default_charset' => 'windows-1252', 'enctype' => 'application/x-www-form-urlencoded', 'accept_charset' => 'UNKNOWN',

I presume that I need to change this to 'accept_charset' => 'UTF-8'. However, I do not see any method in Mechanize that will allow me to do this. Is it possible to use HTML::Form with Mechanize to do this

Would appreciate any help from members of the forum. Thank you

use strict; use warnings; use WWW::Mechanize; use Encode qw(encode decode); # Tried encode, decode # without success # Create a new browser my $browser = WWW::Mechanize->new(autocheck => 1 ); # Tell it to get the main page $browser->get("http://193.1.97.44/focloir/"); # Okay, fill in the form with the first word to look up $browser->form_number(1); # Select first as active form $browser->field("WORD", "acht"); # Next word in dict is achtú # Get a consecutive sequence of words, one word per web request for ($i=1; $i<=2; $i++) { $browser->dump_forms(); # i=1 WORD parameter is acht Hex: [61 63 68 74] # i=2 WORD parameter is achtú Hex: [61 63 68 74 FA] $browser->click(); # Make the Web request print $browser->content; # i=1 Word found # i=2 Message: could not find - # acht&#195;&#186; # Hex: [61 63 68 74 c3 ba] # which is achtú in UTF-8 sleep (1); # Just in case we get into # trouble with the web server # # Pick the second form. It should have the next head word # already filled in # NOTE: application code does not access any parameters on # this form $browser->form_number(2); # Select second form as # active form }

In reply to Re^2: www:mechanize mangles unicode by Desmond Walsh
in thread www:mechanize mangles unicode by red0hat

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.