Problem Parsing with HTML::TableExtract

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am trying to parse a HTML file with HTML::TableExtract. The main aim is to capture the final rows (that contain "TOTAL") into array reference. How come my code below doesn't do the job?

#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Carp;
use HTML::TableExtract;

my $temp_file = do {
    open my $in, '<', 'myfile.html'
        or carp "Can't open in $!\n";
    local $/ = undef;
    <$in>;
};

#--------------------------------------------------
# Extract Element of HTML Table
#--------------------------------------------------

#print Dumper $temp_file ;
( my $id ) = $temp_file =~ /([\w]+\.[\w\d]+)/ms;
print "$id\n";

my $te = HTML::TableExtract->new(
    headers => [
            'Data set','nTP','nFP',
            'nFN','nTN','sTP',
            'sFP','sFN',' ','nSn',
            'nPPV','nSp','nPC',
            'nCC','sSn','sPPV',
            'sASP',
    ]
);


$te->parse($temp_file);
my @all_table_content = $te->tables;

# Here to extract the 'last' row
my @total             = @{ $all_table_content[0]->[-1] };
print Dumper \@all_table_content ;
[download]

The HTML file (myfile.html) that I want to parse and obtain the TOTAL result looks like this:

<html>
<head>
<title>
scrPage
</title>
</head>
<!--

-->

<!-- jsp:setProperty name="manager" property="*" /-->
<body bgcolor="#ffffff">

<h1>
Assessment Score
</h1>
<b>
Here is your confirmation ID: SP.A91389F67D1C79B4157818A8EDF2A6C2
</b>

<br>
<form method="get" action="http://wingless.cs.washington.edu:8080/asse
+ssment/servlet">
<input type="hidden" value="submission/SP.A91389F67D1C79B4157818A8EDF2
+A6C2" name="filenameID"/>
<input type="hidden" name="pageType" value="visualizationForm"/>
<br>
<INPUT TYPE=submit name="action" value="Visualize It">
<input type=submit name="action" value="Get Excel Spreadsheet"/>
<a href=http://bio.cs.washington.edu/assessment/statistics.html>statis
+tics explanation
</form>

<Table border = 3>
<tr><th>Data set<td>nTP<td>nFP<td>nFN<td>nTN<td>sTP<td>sFP<td>sFN<td> 
+<td>nSn<td>nPPV<td>nSp<td>nPC<td>nCC<td>sSn<td>sPPV<td>sASP<tr><th>dm
+01g<td>0<td>80<td>125<td>5795<td>0<td>8<td>7<td> <td>0<td>0<td>0.9863
+83<td>0<td>-0.0169565<td>0<td>0<td>0

<tr><th> 
<tr><th>Fly
<td>0<td>80<td>125<td>5795<td>0<td>8<td>7<td> <td>0<td>0<td>0.986383<t
+d>0<td>-0.0169565<td>0<td>0<td>0

<tr><th>Human
<td>0<td>0<td>0<td>0<td>0<td>0<td>0<td> <td>NaN<td>NaN<td>NaN<td>NaN<t
+d>NaN<td>NaN<td>NaN<td>NaN

<tr><th>Mouse
<td>0<td>0<td>0<td>0<td>0<td>0<td>0<td> <td>NaN<td>NaN<td>NaN<td>NaN<t
+d>NaN<td>NaN<td>NaN<td>NaN

<tr><th>Yeast
<td>0<td>0<td>0<td>0<td>0<td>0<td>0<td> <td>NaN<td>NaN<td>NaN<td>NaN<t
+d>NaN<td>NaN<td>NaN<td>NaN

<tr><th>Total
<td>0<td>80<td>125<td>5795<td>0<td>8<td>7<td> <td>0<td>0<td>0.986383<t
+d>0<td>-0.0169565<td>0<td>0<td>0

</table>

</body>
</html>
[download]

Regards,
Edward

Comment on Problem Parsing with HTML::TableExtract Select or Download Code

Replies are listed 'Best First'.
Re: Problem Parsing with HTML::TableExtract by Fang (Pilgrim) on Dec 07, 2005 at 09:18 UTC
HTML::TableExtract parses that file perfectly (as always). To verify how the module sees your data, you should always try a little snippet like the one in the synopsis of the module. `#!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; die 'Missing argument!' unless (@ARGV); my $file = shift @ARGV; my $te = HTML::TableExtract->new(); $te->parse_file($file); for my $table ($te->tables) { print 'Table (', join(', ', $table->coords), "):\n"; my $rownumber = 1; for my $row ($table->rows) { print "Row $rownumber: ", join(', ', @$row), "\n"; $rownumber++; } }` [download] Your error comes from your `my @total = @{ $all_table_content[0]->[-1] };` line. You probably got the message `Not an ARRAY reference at bin/perl/tableextract.pl line 38` as I did. As you can see from the code above, you need to use the `rows` method. But the trick is that if you simply do `@{ $all_table_content[0]->rows->[-1] }`, perl will evaluate the part up to `rows` in scalar context and will prevent you to do so as you're using strict (and if you remove it, it won't work anyway). So you have to force list context, and take the last element of that list before dereferencing it. In perl, this means: `my @total = @{ [ $all_table_content[0]->rows ]->[-1] }; print join(', ', @total), "\n";` [download] OK, so that's butt-ugly to me, and to any sane person I assume, but it works. So an other, more readable solution could be: `my $table = ($te->tables)[0]; my $count = $table->rows; my $last_row = $table->row($count - 1); print join(', ', @$last_row), "\n";` [download]	[reply] [d/l] [select]
Re: Problem Parsing with HTML::TableExtract by johnnywang (Priest) on Dec 07, 2005 at 06:32 UTC
try (un-tested): `my @all_tables = $te->table_states; my @all_rows = @{ $all_tables[0]->rows }; print Dumper $all_rows[-1];` [download] updated. meant table_states()	[reply] [d/l]
Re^2: Problem Parsing with HTML::TableExtract by monkfan (Curate) on Dec 07, 2005 at 07:11 UTC
Hi, Thanks for the reply. But: `$te->table_stats;` [download] Doesn't seem to exist as a method in the module. Then, I tried: `my @all_tables = $te->table_states(0,0);` [download] Also doesn't do the job. Regards, Edward	[reply] [d/l] [select]
Re: Problem Parsing with HTML::TableExtract by kulls (Hermit) on Dec 07, 2005 at 06:48 UTC
Hi, In your `myfile.html`, i didn't find any closing tags for `td,tr and th`.Maybe this will cause the issue.I guess `HTML::TableExtract` will parse based on that. -kulls	[reply] [d/l] [select]