benchtoplabs has asked for the wisdom of the Perl Monks concerning the following question:

Please pardon me if this question is well-hashed elsewhere,(no pun intended) but I have been digging on this for a few days, and searched everywhere I could find for the answer to no significant avail.. (Not even sure if that even meant what I thought :-)) I am using a script to query a website which contains data on clients. I am sending a "post" with %form_data using LWP and returning the response as $res like so:
$res = $ua->post($my_url, \%form_data);
The website i am posting to is returning data about my customer, which I intend to use to make decisions about what to "post" next, receiving more data and so on... I have done this in the past with another website where I was able to return and use the data as a cleanly formatted XML list of variables and values:
$response = $xml->XMLin($res->content); print "CallerLastName is:", $response->{CallerLastName}, "\n"; print "ReturnStatus is:", $response->{ReturnStatus}, "\n"; print "CallerPhoneNumber is:", $response->{CallerPhoneNumber}, "\n";
In this case now, I am getting a very convoluted HTML table with numerous sub-tables and it's a real mess.. The data I need is "hidden" within the table within named tags:
<html> <head> <title>Login Results</title> </head> <body> <h3>Database Results</h3> <table> <tr> <td>Return Status:&#160;<ReturnStatus>Done</ReturnStatus></td> </tr> </table> <table> <tr> <td>StateInfo:&#160;<StateInfo>0014378</StateInfo></td> </tr> <tr> <td>Multiple Sub Account:&#160;<MultipleSubAccount>N</MultipleSubA +ccount></t d> </tr> <tr> <td>Subscription Count:&#160;<UserCount>1</UserCount></td> </tr> <tr> <td>Default Account:&#160;<DefaultAccount>114879</DefaultAccount>< +/td> </tr> <tr> <td>Caller Phone Number:&#160;<CallerPhoneNumber>8005551212</Calle +rPhoneNumb er></td> </tr> <tr> <td>Caller House Number:&#160;<CallerHouseNumber> 123</Calle +rHouseNumb er></td> </tr> <tr> <td>Apartment Num:&#160;<ApartmentNum> </ApartmentNum></td> </tr> <tr> <td>Caller Salutation:&#160;<CallerSalutation></CallerSalutation>< +/td> </tr> <tr> <td>Caller First Name:&#160;<CallerFirstName>JOHN</CallerFirstName +></td> </tr> <tr> <td>Caller Last Name:&#160;<CallerLastName>SMITH</CallerLastName>< +/td> </tr> <tr> <td>Salutation:&#160;<Salutation></Salutation></td> </tr> <tr> <td>FirstName:&#160;<FirstName>JOHN</FirstName></td> </tr> <tr> <td>MiddleInitial:&#160;<MiddleInitial></MiddleInitial></td> </tr> <tr> <td>LastName:&#160;<LastName>SMITH</LastName></td> </tr> ..........SNIP..........

I am trying to pull out data such as their account number or name etc. Whai I am finding is that if I bring it through XMLin like so: $response = XMLin($res->content); I get an output like this:

$VAR1 = { 'body' => { 'table' => [ { 'tr' => { 'td' => { 'ReturnStatus' => 'Do +ne', 'content' => "Return +Status:\x{ a0}" } } }, { 'tr' => [ { 'td' => { 'StateInfo' => '001 +4378', 'content' => "State +Info:\x{a0 }" } }, { 'td' => { 'MultipleSubAccount +' => 'N', 'content' => "Multi +ple Sub Ac count:\x{a0}" } }, { 'td' => { 'UserCount' => '1', 'content' => "User +Count:\x{a0}" } }, { 'td' => { 'content' => "Defau +lt Account :\x{a0}", 'DefaultAccount' => + '114879' } }, { ..........SNIP..........

Looking through this mess, it looks like I can probably use it as a 3 dimensional array, (but I was hopiing for something easier.

Right now I am referencing data like so:

print "content is:", $response->{body}->{table}->{tr}->{td}->{content} +, "\n";

But this seems like a real mess, and is looking like I am going to have to write a seperate sub-routine for every different page I post to, as they are all formatted slightly different. I was hoping to do it like I have in the past where I can get it into a single level XML array.

Hopefully I haven't confused anyone too much, but am I on the right track, or is there an easier way to grab this data? Thanks for any input, and please redirect me if I posted wrong.

Replies are listed 'Best First'.
Re: Parse HTML Code for hidden values
by ikegami (Patriarch) on Aug 06, 2010 at 00:47 UTC
    It's easy to squash what you have.
    my %rec; for (@{ $response->{body}{table} }) { my ($key) = grep $_ ne 'content', keys(%{$_->{tr}{td}}); my $val = $_->{tr}{td}{$key}; $rec{$key} = $val; } print("$rec{LastName}\n"); # For example
      So, I have to admit... I am a bit more of a "copy-paste" coder... I get the logic, but not always the syntax. :-) I see what you are saying here, and I think it is my answer, but for the life of me I can't get the loop to run. I was getting no data out of this, so I put a debug statement in the "for" loop, and it doesn't appear to be running past the first 2 lines. I can dump $response, and I get valid data..
      Dumper($response->{body}->{table})
      Here is a snipet of my test code:
      print "body-table is:", Dumper($response->{body}->{table}), "\n"; print "=============================================================== +===\n"; my %rec; for (@{ $response->{body}{table} }) { print "Data Record Found - Body Table \n"; my ($key) = grep $_ ne 'content', keys(%{$_->{tr}{td}}); my $val = $_->{tr}{td}{$key}; $rec{$key} = $val; } print "rec is:", Dumper($rec), "\n";
      And the result is:
      body-table is:$VAR1 = [ { 'tr' => { 'td' => { 'ReturnStatus' => 'Completed', 'content' => "Return Status:\x{a0}" } } }, { 'tr' => [ { 'td' => { 'StateInfo' => '12345', 'content' => "StateInfo:\x{a0}" } }, { 'td' => { 'MultipleSubAccount' => 'N', 'content' => "Multiple Sub Account:\x{a0 +}" } }, { 'td' => { 'Count' => '1', 'content' => "Count:\x{a0}" } }, { 'td' => { 'content' => "Default Account:\x{a0}", 'DefaultAccount' => '1234567' } }, { 'td' => { 'content' => "Caller Phone Number:\x{a0} +", 'CallerPhoneNumber' => '214-555-1212' } }, { 'td' => { 'CallerHouseNumber' => ' 123', 'content' => "Caller House Number:\x{a0} +" } }, { 'td' => { 'ApartmentNum' => {}, 'content' => "Apartment Num:\x{a0}" } }, { 'td' => { 'CallerSalutation' => {}, 'content' => "Caller Salutation:\x{a0}" } }, { 'td' => { 'CallerFirstName' => 'JOHN', 'content' => "Caller First Name:\x{a0}" } }, { 'td' => { 'content' => "Caller Last Name:\x{a0}", 'CallerLastName' => 'SMITH' } }, { 'td' => { 'content' => "Salutation:\x{a0}", 'Salutation' => {} } }, { 'td' => { 'FirstName' => 'JOHN', 'content' => "FirstName:\x{a0}" } }, { 'td' => { 'content' => "MiddleInitial:\x{a0}", 'MiddleInitial' => {} } }, { 'td' => { 'LastName' => 'SMITH', 'content' => "LastName:\x{a0}" } }, { 'td' => { 'Honorific' => {}, 'content' => "Honorific:\x{a0}" } }, { 'td' => { 'content' => "FullName:\x{a0}", 'FullName' => 'JOHN SMITH' } }, { 'td' => { 'OtherName' => {}, 'content' => "OtherName:\x{a0}" } }, { 'td' => { 'OtherNameUsage' => {}, 'content' => "OtherNameUsage:\x{a0}" } }, { 'td' => { 'HouseNumber' => '123', 'content' => "House Number:\x{a0}" } }, { 'td' => { 'content' => "UnitNumber:\x{a0}", 'UnitNumber' => {} } }, { 'td' => { 'content' => "City:\x{a0}", 'City' => 'ANYTOWN' } }, { 'td' => { 'content' => "State:\x{a0}", 'State' => 'OH' } }, { 'td' => { 'Zip' => '12345-1234', 'content' => "Zip:\x{a0}" } }, { 'td' => { 'AddressLine1' => 'JOHN SMITH', 'content' => "AddressLine1:\x{a0}" } }, { 'td' => { 'AddressLine2' => '123 W MAIN ST', 'content' => "AddressLine2:\x{a0}" } }, { 'td' => { 'AddressLine3' => 'ANYTOWN OH 12345-123 +4', 'content' => "AddressLine3:\x{a0}" } }, { 'td' => { 'AddressLine4' => {}, 'content' => "AddressLine4:\x{a0}" } }, { 'td' => { 'AddressLine5' => {}, 'content' => "AddressLine5:\x{a0}" } }, { 'td' => { 'content' => "AddressLine6:\x{a0}", 'AddressLine6' => {} } }, { 'td' => { 'AddressLine7' => {}, 'content' => "AddressLine7:\x{a0}" } }, { 'td' => { 'content' => "AddressLine8:\x{a0}", 'AddressLine8' => {} } }, { 'td' => { 'a' => { 'href' => 'lastpmt.html?stateInfo +=12345', 'content' => 'Last Payment Info' } } }, { 'td' => { 'a' => { 'href' => 'serverrform1.html?stat +eInfo=12345', 'content' => 'Service Error' } } }, { 'td' => { 'a' => { 'href' => 'stopinfoform.html?stat +eInfo=12345', 'content' => 'Stop/Start' } } }, { 'td' => { 'a' => { 'href' => 'renewinfoform.html?sta +teInfo=12345', 'content' => 'Make a Payment' } } }, { 'td' => { 'a' => { 'href' => 'loginform.html', 'content' => 'Change Login' } } } ] } ]; ================================================================== Data Record Found - Body Table Data Record Found - Body Table rec is:$VAR1 = undef;
      Any help is GREATLY appreciated!! Thanks!
        Unlike in your HTML example in the OP, the dump suggests you have many td's in one tr, not one td per tr. You can for example add an inner loop over the td's - but if the structure of your data is not fixed, you can have similar problems every time another level of nesting appears!