I'm trying to parse some HTML I've downloaded from a site I have no control over (so no "just change the site" answers :) and am having trouble parsing the form I'm pasting below.

I am also pasting a test script that has the basic flow of my code. I must be missing something, which is entirely possible considering my newbie-ness to web/HTML/HTTP processing, so be gentle if it turns out I'm doing something really stupid :)

A few notes may be in order, I have changed the html to protect the innocent.
In the actual content of the page there are 2 forms, which is why I parse multiple forms in my code. If it works doing it one at a time, I'm all for code that works.
#!/usr/bin/perl -w use HTTP::Request::Common qw(POST GET); use HTTP::Cookies; use HTTP::Request::Form; use LWP::UserAgent; use HTML::Form; use HTML::TreeBuilder; open( TEST, "<test.html") or die "cannot opentest.html\n$!"; #Read in the whole file as a single string., $/ is the delimiter used, + use small scope $text = do { local $/; <TEST>}; print $text; $tree = HTML::TreeBuilder->new; $tree->parse( $text ); $tree->eof(); my @requestForms = HTTP::Request::Form->new_many( $tree ); $tree->delete(); #just some code to pause on while I play with the debugger. foreach $form (@requestForms) { print "$form\n"; # hash...awesome! #do something }

The html I'm trying to parse
<form name="form1" method="post" action="/foo/bar.php"> <table width="550" border="0" cellspacing="0" cellpadding="0" align="c +enter"> <tr> <td> <div align="center">IMAGE</div> </td> </tr> <tr> <td> <div align="center"> <table width="550" border="0" cellspacing="0" cellpadding="0"> <tr> <td width="275"> <div align="center"><span class="plaintextbold">IMAGE</spa +n></div> </td> <td> <div align="center" class="fightdamage"> TEXT <br> <hr width="175"> <span class="plaintextbold">TEXT: 1 <br> </span> <div align="center"><span class="plaintextbold">TEXT:< +/span> 14 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 11 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 3 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 3 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 15 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 16 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 16 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 0 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 0 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 31 / 31 </div> <div align="center"><span class="plaintextbold">TEXT:< +/span> 0 / 0 <br> <hr width="175"> </div> </div> </td> </tr> </table> </div> </td> </tr> <tr> <td> <div align="center"></div> </td> </tr> <tr> <td> <div align="center"> <hr width="550"> <span class="plaintextbold"> <br> <select name="bar"> <option>OPTION1</option> <option>OPTION2</option> <option>OPTION3</option> </select> <input type="submit" name="Submit2" value="Go!" class="login"> <br> <br> </span></div> </td> </tr> <tr> <td> <div align="center">IMAGE</div> </td> </tr> <tr> <td> <div align="center"><br> </div> </td> </tr> </table> </form> </div> </td> </tr>
I'm a debugger kind of guy, here is the output when I examine the @requestForms object.
DB<3> x @requestForms 0 HTTP::Request::Form=HASH(0x864f6b0) 'allfields' => ARRAY(0x864fb4c) 0 'bar' 'base' => undef 'buttons' => ARRAY(0x8686edc) 0 'Submit2' 'buttontypes' => HASH(0x8668c70) 'Submit2' => ARRAY(0x8650b78) 0 'submit' 'buttonvals' => HASH(0x86828a4) 'Submit2' => ARRAY(0x864f518) 0 'Go!' 'checkboxstate' => HASH(0x8682484) empty hash 'debug' => undef 'fields' => ARRAY(0x864fb34) 0 'bar' 'fieldtypes' => HASH(0x866bee0) 'bar' => 'select' 'fieldvals' => HASH(0x86805b4) empty hash 'link' => '/foo/bar.php' 'method' => 'post' 'name' => 'form1' 'selections' => HASH(0x8680518) 'bar' => ARRAY(0x8650ab8) 0 undef 1 undef 2 undef 'upload' => 0
You may note the undef items, I'm assuming they are supposed to have OPTIONX in them?
I'm not sure how the default option is specified.
I'm not sure what functions you use to select a selection. (the field function does not seem to do what I want)
My copy of Perl & LWP should be here any day now.....maybe that will solve everything :)

Thanks!

In reply to Parsing Forms with selections by Helter

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.