comment on

I have encountered the same problem of form parameters being converted to UTF-8 when using WWW::Mechanize to access a web site.

My application automates access to a dictionary web site in order to build a grammar database. The initial web page has one HTML form. My application inserts a starting head word in that form. The response web page has a second form that contains the next head word in the dictionary. By making that form the current form a loop process is established to retrieve N conseecutive words from the dictionary.

The trouble is that Perl converts accented characters (the 5 vowels á í é ó ú uc and lc) into UTF-8 and the dictionary misreads all words with these characters

Looking at the HTML, I find the settings ;

        'default_charset' => 'windows-1252',
                 'enctype' => 'application/x-www-form-urlencoded',
                 'accept_charset' => 'UNKNOWN',
[download]

I presume that I need to change this to 'accept_charset' => 'UTF-8'. However, I do not see any method in Mechanize that will allow me to do this. Is it possible to use HTML::Form with Mechanize to do this

Would appreciate any help from members of the forum. Thank you

    use strict;
    use warnings;
    use WWW::Mechanize;
    use Encode qw(encode decode);      # Tried encode, decode 
                                       # without success
            
#   Create a new browser
    my $browser = WWW::Mechanize->new(autocheck => 1 );
    
#   Tell it to get the main page
    $browser->get("http://193.1.97.44/focloir/"); 
    
#   Okay, fill in the form with the first word to look up
    $browser->form_number(1);         # Select first as active form
    $browser->field("WORD", "acht");  # Next word in dict is achtú
    
#   Get a consecutive sequence of words, one word per web request    
    for ($i=1; $i<=2; $i++)
    {
        $browser->dump_forms();       # i=1 WORD parameter is acht  
                                            Hex: [61 63 68 74]
                                      # i=2 WORD parameter is achtú 
                                            Hex: [61 63 68 74 FA]
        $browser->click();            # Make the Web request
        print $browser->content;      # i=1 Word found    
                                      # i=2 Message: could not find - 
                                      #     acht&#195;&#186; 
                                      #     Hex: [61 63 68 74 c3 ba]
                                      #     which is achtú in UTF-8   
        sleep (1);                    # Just in case we get into 
                                      # trouble with the web server
#        
#       Pick the second form. It should have the next head word 
#       already filled in
#       NOTE: application code does not access any parameters on 
#             this form
        $browser->form_number(2);     # Select second form as 
                                      # active form
    }
[download]

In reply to Re^2: www:mechanize mangles unicode by Desmond Walsh
in thread www:mechanize mangles unicode by red0hat

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.