monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am trying to parse a HTML file with HTML::TableExtract. The main aim is to capture the final rows (that contain "TOTAL") into array reference. How come my code below doesn't do the job?
#!/usr/bin/perl -w use strict; use Data::Dumper; use Carp; use HTML::TableExtract; my $temp_file = do { open my $in, '<', 'myfile.html' or carp "Can't open in $!\n"; local $/ = undef; <$in>; }; #-------------------------------------------------- # Extract Element of HTML Table #-------------------------------------------------- #print Dumper $temp_file ; ( my $id ) = $temp_file =~ /([\w]+\.[\w\d]+)/ms; print "$id\n"; my $te = HTML::TableExtract->new( headers => [ 'Data set','nTP','nFP', 'nFN','nTN','sTP', 'sFP','sFN',' ','nSn', 'nPPV','nSp','nPC', 'nCC','sSn','sPPV', 'sASP', ] ); $te->parse($temp_file); my @all_table_content = $te->tables; # Here to extract the 'last' row my @total = @{ $all_table_content[0]->[-1] }; print Dumper \@all_table_content ;
The HTML file (myfile.html) that I want to parse and obtain the TOTAL result looks like this:
<html> <head> <title> scrPage </title> </head> <!-- --> <!-- jsp:setProperty name="manager" property="*" /--> <body bgcolor="#ffffff"> <h1> Assessment Score </h1> <b> Here is your confirmation ID: SP.A91389F67D1C79B4157818A8EDF2A6C2 </b> <br> <form method="get" action="http://wingless.cs.washington.edu:8080/asse +ssment/servlet"> <input type="hidden" value="submission/SP.A91389F67D1C79B4157818A8EDF2 +A6C2" name="filenameID"/> <input type="hidden" name="pageType" value="visualizationForm"/> <br> <INPUT TYPE=submit name="action" value="Visualize It"> <input type=submit name="action" value="Get Excel Spreadsheet"/> <a href=http://bio.cs.washington.edu/assessment/statistics.html>statis +tics explanation </form> <Table border = 3> <tr><th>Data set<td>nTP<td>nFP<td>nFN<td>nTN<td>sTP<td>sFP<td>sFN<td> +<td>nSn<td>nPPV<td>nSp<td>nPC<td>nCC<td>sSn<td>sPPV<td>sASP<tr><th>dm +01g<td>0<td>80<td>125<td>5795<td>0<td>8<td>7<td> <td>0<td>0<td>0.9863 +83<td>0<td>-0.0169565<td>0<td>0<td>0 <tr><th> <tr><th>Fly <td>0<td>80<td>125<td>5795<td>0<td>8<td>7<td> <td>0<td>0<td>0.986383<t +d>0<td>-0.0169565<td>0<td>0<td>0 <tr><th>Human <td>0<td>0<td>0<td>0<td>0<td>0<td>0<td> <td>NaN<td>NaN<td>NaN<td>NaN<t +d>NaN<td>NaN<td>NaN<td>NaN <tr><th>Mouse <td>0<td>0<td>0<td>0<td>0<td>0<td>0<td> <td>NaN<td>NaN<td>NaN<td>NaN<t +d>NaN<td>NaN<td>NaN<td>NaN <tr><th>Yeast <td>0<td>0<td>0<td>0<td>0<td>0<td>0<td> <td>NaN<td>NaN<td>NaN<td>NaN<t +d>NaN<td>NaN<td>NaN<td>NaN <tr><th>Total <td>0<td>80<td>125<td>5795<td>0<td>8<td>7<td> <td>0<td>0<td>0.986383<t +d>0<td>-0.0169565<td>0<td>0<td>0 </table> </body> </html>

Regards,
Edward

Replies are listed 'Best First'.
Re: Problem Parsing with HTML::TableExtract
by Fang (Pilgrim) on Dec 07, 2005 at 09:18 UTC

    HTML::TableExtract parses that file perfectly (as always). To verify how the module sees your data, you should always try a little snippet like the one in the synopsis of the module.

    #!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; die 'Missing argument!' unless (@ARGV); my $file = shift @ARGV; my $te = HTML::TableExtract->new(); $te->parse_file($file); for my $table ($te->tables) { print 'Table (', join(', ', $table->coords), "):\n"; my $rownumber = 1; for my $row ($table->rows) { print "Row $rownumber: ", join(', ', @$row), "\n"; $rownumber++; } }

    Your error comes from your my @total = @{ $all_table_content[0]->[-1] }; line. You probably got the message Not an ARRAY reference at bin/perl/tableextract.pl line 38 as I did. As you can see from the code above, you need to use the rows method. But the trick is that if you simply do @{ $all_table_content[0]->rows->[-1] }, perl will evaluate the part up to rows in scalar context and will prevent you to do so as you're using strict (and if you remove it, it won't work anyway). So you have to force list context, and take the last element of that list before dereferencing it. In perl, this means:

    my @total = @{ [ $all_table_content[0]->rows ]->[-1] }; print join(', ', @total), "\n";

    OK, so that's butt-ugly to me, and to any sane person I assume, but it works. So an other, more readable solution could be:

    my $table = ($te->tables)[0]; my $count = $table->rows; my $last_row = $table->row($count - 1); print join(', ', @$last_row), "\n";
Re: Problem Parsing with HTML::TableExtract
by johnnywang (Priest) on Dec 07, 2005 at 06:32 UTC
    try (un-tested):
    my @all_tables = $te->table_states; my @all_rows = @{ $all_tables[0]->rows }; print Dumper $all_rows[-1];
    updated. meant table_states()
      Hi,
      Thanks for the reply. But:
      $te->table_stats;
      Doesn't seem to exist as a method in the module.
      Then, I tried:
      my @all_tables = $te->table_states(0,0);
      Also doesn't do the job.

      Regards,
      Edward
Re: Problem Parsing with HTML::TableExtract
by kulls (Hermit) on Dec 07, 2005 at 06:48 UTC
    Hi,
    In your  myfile.html, i didn't find any closing tags for td,tr and th.Maybe this will cause the issue.I guess HTML::TableExtract will parse based on that.
    -kulls