hoopsbwc34 has asked for the wisdom of the Perl Monks concerning the following question:

I've gotten alot of help with removing white space, etc. in my extracted tables by searching here, but I'm having one problem I haven't seen a solution posted for yet. I'm extracting a table where one of the headers I want to keep is embedded in another table. ie. the headers look like this when extracted: header 1,,header 3, header 4 ...etc. That's because header 2 is in another table. Actually in that part of the table there are two additional tables (the first formatting the name I want, the second a search bar). I have no idea how to pull this data out (other than just going to a 'C' like for loop and pulling out $row2,$row4 from the full table maybe?). Is there a way to specify a header location instead of name? Any one help? Here's my code:
#!/usr/bin/perl use strict; use HTML::TableExtract; use LWP::UserAgent; my $url = 'http://some.foo.com'; my $ua = LWP::UserAgent->new; my $te = new HTML::TableExtract( headers=> ['Header 2','Header 4'], +depth => 0, count => 3); my $res = $ua->request(HTTP::Request->new(GET => $url), sub {$te->parse($_[0])}); use Data::Dumper; $Data::Dumper::Useqq = 1; foreach my $ts ($te->table_states) { print "Table (", join(',', $ts->coords), "):\n"; foreach my $row ($ts->rows) { foreach (@$row) { tr{\t\r\n\xAO}{ }s; s{^\s+}{}; s{\s+$}{}; } # print Dumper $row; print join(',', @$row), "\n"; } }

Replies are listed 'Best First'.
Re: Embedded Table Headers with HTML::TableExtract
by bobn (Chaplain) on Jul 12, 2003 at 23:13 UTC

    Since you don't provide a sample of the HTML you are attempting to parse, it's pretty much impossible to knwo what's going on. I think you mean that you have one table contained within another table, and it is the inner table that interests you.

    Assuming that is the case, and looking at the doc for HTML::TableExtract, it appears that you can pass it a string with $te->parse($html);, so I'd parse the outer table, and upon finding the element which is the contained table, hand that text to another properly defined instance of HTML::TableExtract->parse().

    Update: After playing around, I observe the following:

    • Using depth => 0 causes only the outermost table in each table to be parsed.
    • Including headers form different levels causes nothing to be returned to me.
    Using depth=><some_numer> would let you pick out the embeddded table you want.

    HTML::TableExtract is pretty cool.

    --Bob Niederman, http://bob-n.com
      Yeah, I knew that might be an issue. But the data isn't accessable through the web (it's on an internal server behind a firewall for work). I've tried to recreate the problem in the HTML below.

      What I am trying to pull out is that "Header 2" that is actually in the "Header 1" position of the next deeper table. Changing the depth with TableExtract will pull out a deeper table, but it doesn't link it with the previous table. The documentation for this module talks about chaining tables, but it seems as I understand it, that you are chaining the table for ALL cases. (ie. not just the Header 2 situation, but Header 1,2,3,4 would then be expected to be in a deeper table)

      Seems like a case this module probably didn't expect and wouldn't be expected to work properly with, however, I'm wondering if there is way I can still get the functionality out.... thanks!

      <table> <tr> <td>Header 1</td> <td> <table> <tr><form action="search.php" name="searchsimple" meth +od="post"> <td><b>Header 2</b> </td> <td> <a href="thispage.php" ><font size="1" co +lor="white">Update this page</font></a> </td> <td> <table> <tr> td><b>Search the web:</b></font> + </td> <td> <input type="text" name=sear +chtext size=10> <input type="submit" value=" +Search!"> </td> </tr> </table> </td> </tr></form> </table> </td> <td> Header 3</td> <td> Header 4</td> </tr> <tr>... data </tr> <tr>... data </tr> <tr>... data </tr> </table>
Re: Embedded Table Headers with HTML::TableExtract
by hoopsbwc34 (Initiate) on Jul 15, 2003 at 17:30 UTC
    The solution I came up with was to not use the header input for HTML::TableExtract then in the foreach loop call out the column number I wanted. This also required me to skip the first line (as that is the header information) so I added that counter. It's a simple solution, but I'd still be interested in knowing if there is a way to do this through the module:
    my $i=0; foreach my $ts ($te->table_states) { $i++; next if $i==1; print "Table (", join(',', $ts->coords), "):\n"; foreach my $row ($ts->rows) { foreach (@$row) { tr{\t\r\n\xAO}{ }s; s{^\s+}{}; s{\s+$}{}; } print "@$row[2] @$row[4]"; }
Re: Embedded Table Headers with HTML::TableExtract
by mojotoad (Monsignor) on Jul 15, 2003 at 19:31 UTC
    There's currently no way to "flatten" tables into other tables for purposes of header identification. By the time you nailed down the coordinates of tables you wished to flatten, you might as well have been using depth/count coordinates to begin with.

    One feature the module lacks is tracking downward ownership of tables (i.e., what subtables does a particular table have). You can do this in the opposite direction, going back up the tree, but not downward. Such information could be useful in cases like this.

    For the time being, however, your solution is about as good as it gets for your particular case.

    Matt