Using HTML::Parser extract text from tables

zzspectrez has asked for the wisdom of the Perl Monks concerning the following question:

Here is my first attempt at using HTML::Parser. Up to now I have been using perls patern matching abilities to extract the data I need from html files. I know this is bad.

So I have been trying to figure out how to get HTML::Parser to work for my needs. I wrote a perl script that downloads my bank information from my banks secure site using LWP. Now I want to extract just the account information. The layout of the page make use of a layering of multiple tables. I wasnt sure the best way to do this. I located HTML::TableExtract on cpan which should do what I need. Looking over its doc's it seems more usefull for situations where the tables have headers which this has none. I need to be able to get the text from a specific table row column. I dont think this module does it. So I made the following which works.. Would like suggestions on how I could Improve it. Here is a stripped down version of what Im doing. To get at the data do something like $table[12][1][2] which is the text in row 1, column 2 of the 12 table in the html file. Indexes are based off 1 not 0.

#!/usr/local/bin/perl -w
use strict;
use HTML::Parser;

my  @table;
my  @save;
my  $count    = 0;
my  $row      = 0;
my  $column   = 0;
my  $in_table = 0;

my  $p = HTML::Parser->new( api_version    =>3,
                    handlers       => [ start => [\&_start, "tagname, 
+attr"],
                                                end   => [\&_end,   "t
+agname"],
                                                text  => [\&_text,  "d
+text"],
                                              ],
                            marked_sections =>1,
                           );

$p->parse_file('test.html');

sub _start {
  my ($tag, $attr) = shift;
  if ($tag eq 'table'){
    push @save, [$row,$column];
    $row = $column = 0;
    ++$count;
    $in_table++;
  }
  $row++    if ($tag eq 'tr');
  $column++ if ($tag eq 'td');  
}

sub _end {
  my ($tag, $attr) = shift;
  if ($tag eq 'table') {
    ($row, $column) = @{ pop @save };
    --$in_table;
  }
  $column = 0 if ($tag eq 'tr');
}

sub _text {
  my $text = shift;
  chomp $text;
  $text =~ s/\xa0//; # some reason data has bunch of \xA0 characters ?
+?&nbsp??
  return unless $text;
  $table[$count][$row][$column] .= $text if ($in_table) && ($text !~ m
+/^\s+$/);
}

## print data
print 'ACCOUNT:   ',$table[12][1][2], "\n";
print 'BALANCE:   ',$table[12][1][3], "\n";
print 'AVAILABLE: ',$table[12][1][4], "\n";
[download]

Thanks!
zzSPECTREz

Comment on Using HTML::Parser extract text from tables Select or Download Code

Replies are listed 'Best First'.
Re: Using HTML::Parser extract text from tables by OeufMayo (Curate) on Jan 16, 2001 at 14:07 UTC
To get at the data do something like $table[12][1][2] which is the text in row 1, column 2 of the 12 table in the html file. Indexes are based off 1 not 0. You probably mean the second row of the third column of the 13th table, or did you mess with the $[ variable? :) <kbd>-- PerlMonger::Paris(http => 'paris.pm.org');</kbd>	[reply]
Re: Using HTML::Parser extract text from tables by goldclaw (Scribe) on Jan 16, 2001 at 17:44 UTC
You might want to save the the current table number in @save as well, to be able to handle tables that have both other tables and text inside. To do this, you could apply these changes: At the top: `my $tablenr=0;` [download] In _start: `if ($tag eq 'table'){ push @save, [$tablenr,$row,$column]; $row = $column = 0; $tablenr=$count; ++$count; $in_table++; }` [download] In _end: `if ($tag eq 'table') { ($tablenr, $row, $column) = @{ pop @save }; --$in_table; }` [download] In Text: `$table[$tablenr][$row][$column] .= $text if ($in_table) && ($text !~ m +/^\s+$/);` [download] You might want to add an initialization of @save as well, to avoid trying to der eference an undefined value when you leave a toplevel table. Something like this perhaps: `my @save=([]); #initialize with an empty list as first element.` [download] Regards, GoldClaw	[reply] [d/l] [select]
Re: Re: Using HTML::Parser extract text from tables by zzspectrez (Hermit) on Jan 18, 2001 at 12:07 UTC
Giving thought to the problem of nested subs I rewrote it as a module and fixed the error. See the new module. And the following is a test script using it. Just enter the table,row,col and it will print out the text. type quit when done. If anyone else has any suggestions or improvements for the module I would be glad to hear them. #!/usr/local/bin/perl -w use strict; use Table; my $table = Table->new; my $content = join '', ( <DATA> ); $table->parse_it(\$content); print "INPUT TABLE,ROW,COL: "; while (my $inp = <STDIN>){ chomp $inp; last if $inp eq 'quit'; my ($x,$y,$z) = split ',', $inp; next unless ($x) && ($y) && ($z); print $table->[$x][$y][$z],"\n" if $table->[$x][$y][$z]; print "INPUT TABLE,ROW,COL: "; } __END__ <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <html> <head><title>tester.html</title></head> <body> <h1>tester.html</h1> <TABLE> <TR><TD>TABLE 1:ROW 1:COL1</TD><TD>TABLE 1:ROW1:COL2</TD></TR> <TR><TD><TABLE><TR><TD>TABLE 2:ROW1:COL1</TD></TR></TABLE></TD><TD +>TABLE 1:ROW2:COL2</TD></TR> <TR><TD><TABLE><TR><TD><TABLE><TR><TD>TABLE4:ROW1:COL1</TD><TD>TAB +LE4:ROW1:COL2</TD></TR></TABLE>TABLE3:ROW1:COL1</TD></TR></TABLE></TD +></TR> </TABLE> <TABLE> <TR><TD>TABLE5:ROW1:COL1</TD><TD>TABLE5:ROW1:COL2</TD></TR> <TR><TD>TABLE5:ROW2:COL1</TD></TR> </TABLE> <hr> </body> </html> [download]	[reply] [d/l]
Re: Re: Using HTML::Parser extract text from tables by zzspectrez (Hermit) on Jan 17, 2001 at 06:28 UTC
You might want to add an initialization of @save as well, to avoid trying to der eference an undefined value when you leave a toplevel table. Something like this perhaps: Actually since @save is only being accesed on an end tag that means a begin tag was involved which pushed a value on save. In case of the first table the values 0,0 will pushed on the save and then poped when that table ends. So I dont think I need to wory about dereferncing an undefined value. I dont think pushing $tablenr will fix the problem, because when the old value is poped it will think on the next table it should $tablenr++ which will ovewrite the previous data. Have to do something else.	[reply]
Re: Re: Re: Using HTML::Parser extract text from tables by Anonymous Monk on Jan 17, 2001 at 15:45 UTC
You are right about that @save and undefined values thing. My brain must have fallen asleep there for a while. Pushing tablenr will fix the recursive table though. You still increase $count each time you encounter a table start. You also set $tablenr to $count. Its only on table end that you restore $tablenr, but you do _not_ restore $count. Hence, each table is given a unique, increasing number. Have you tried it btw? I haven't. so I'm only speaking theoretically here.... regards, GoldClaw	[reply]