zzspectrez has asked for the wisdom of the Perl Monks concerning the following question:

Here is my first attempt at using HTML::Parser. Up to now I have been using perls patern matching abilities to extract the data I need from html files. I know this is bad.

So I have been trying to figure out how to get HTML::Parser to work for my needs. I wrote a perl script that downloads my bank information from my banks secure site using LWP. Now I want to extract just the account information. The layout of the page make use of a layering of multiple tables. I wasnt sure the best way to do this. I located HTML::TableExtract on cpan which should do what I need. Looking over its doc's it seems more usefull for situations where the tables have headers which this has none. I need to be able to get the text from a specific table row column. I dont think this module does it. So I made the following which works.. Would like suggestions on how I could Improve it. Here is a stripped down version of what Im doing. To get at the data do something like $table[12][1][2] which is the text in row 1, column 2 of the 12 table in the html file. Indexes are based off 1 not 0.

#!/usr/local/bin/perl -w use strict; use HTML::Parser; my @table; my @save; my $count = 0; my $row = 0; my $column = 0; my $in_table = 0; my $p = HTML::Parser->new( api_version =>3, handlers => [ start => [\&_start, "tagname, +attr"], end => [\&_end, "t +agname"], text => [\&_text, "d +text"], ], marked_sections =>1, ); $p->parse_file('test.html'); sub _start { my ($tag, $attr) = shift; if ($tag eq 'table'){ push @save, [$row,$column]; $row = $column = 0; ++$count; $in_table++; } $row++ if ($tag eq 'tr'); $column++ if ($tag eq 'td'); } sub _end { my ($tag, $attr) = shift; if ($tag eq 'table') { ($row, $column) = @{ pop @save }; --$in_table; } $column = 0 if ($tag eq 'tr'); } sub _text { my $text = shift; chomp $text; $text =~ s/\xa0//; # some reason data has bunch of \xA0 characters ? +?&nbsp?? return unless $text; $table[$count][$row][$column] .= $text if ($in_table) && ($text !~ m +/^\s+$/); } ## print data print 'ACCOUNT: ',$table[12][1][2], "\n"; print 'BALANCE: ',$table[12][1][3], "\n"; print 'AVAILABLE: ',$table[12][1][4], "\n";

Thanks!
zzSPECTREz

Replies are listed 'Best First'.
Re: Using HTML::Parser extract text from tables
by OeufMayo (Curate) on Jan 16, 2001 at 14:07 UTC

    To get at the data do something like $table[12][1][2] which is the text in row 1, column 2 of the 12 table in the html file. Indexes are based off 1 not 0.

    You probably mean the second row of the third column of the 13th table, or did you mess with the $[ variable? :)

    <kbd>--
    PerlMonger::Paris(http => 'paris.pm.org');</kbd>
Re: Using HTML::Parser extract text from tables
by goldclaw (Scribe) on Jan 16, 2001 at 17:44 UTC
    You might want to save the the current table number in @save as well, to be able to handle tables that have both other tables and text inside. To do this, you could apply these changes: At the top:
    my $tablenr=0;
    In _start:
    if ($tag eq 'table'){ push @save, [$tablenr,$row,$column]; $row = $column = 0; $tablenr=$count; ++$count; $in_table++; }
    In _end:
    if ($tag eq 'table') { ($tablenr, $row, $column) = @{ pop @save }; --$in_table; }
    In Text:
    $table[$tablenr][$row][$column] .= $text if ($in_table) && ($text !~ m +/^\s+$/);
    You might want to add an initialization of @save as well, to avoid trying to der eference an undefined value when you leave a toplevel table. Something like this perhaps:
    my @save=([]); #initialize with an empty list as first element.
    Regards, GoldClaw

      Giving thought to the problem of nested subs I rewrote it as a module and fixed the error. See the new module. And the following is a test script using it. Just enter the table,row,col and it will print out the text. type quit when done.

      If anyone else has any suggestions or improvements for the module I would be glad to hear them.

      #!/usr/local/bin/perl -w use strict; use Table; my $table = Table->new; my $content = join '', ( <DATA> ); $table->parse_it(\$content); print "INPUT TABLE,ROW,COL: "; while (my $inp = <STDIN>){ chomp $inp; last if $inp eq 'quit'; my ($x,$y,$z) = split ',', $inp; next unless ($x) && ($y) && ($z); print $table->[$x][$y][$z],"\n" if $table->[$x][$y][$z]; print "INPUT TABLE,ROW,COL: "; } __END__ <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <html> <head><title>tester.html</title></head> <body> <h1>tester.html</h1> <TABLE> <TR><TD>TABLE 1:ROW 1:COL1</TD><TD>TABLE 1:ROW1:COL2</TD></TR> <TR><TD><TABLE><TR><TD>TABLE 2:ROW1:COL1</TD></TR></TABLE></TD><TD +>TABLE 1:ROW2:COL2</TD></TR> <TR><TD><TABLE><TR><TD><TABLE><TR><TD>TABLE4:ROW1:COL1</TD><TD>TAB +LE4:ROW1:COL2</TD></TR></TABLE>TABLE3:ROW1:COL1</TD></TR></TABLE></TD +></TR> </TABLE> <TABLE> <TR><TD>TABLE5:ROW1:COL1</TD><TD>TABLE5:ROW1:COL2</TD></TR> <TR><TD>TABLE5:ROW2:COL1</TD></TR> </TABLE> <hr> </body> </html>
      You might want to add an initialization of @save as well, to avoid trying to der eference an undefined value when you leave a toplevel table. Something like this perhaps:

      Actually since @save is only being accesed on an end tag that means a begin tag was involved which pushed a value on save. In case of the first table the values 0,0 will pushed on the save and then poped when that table ends. So I dont think I need to wory about dereferncing an undefined value.

      I dont think pushing $tablenr will fix the problem, because when the old value is poped it will think on the next table it should $tablenr++ which will ovewrite the previous data. Have to do something else.

        You are right about that @save and undefined values thing. My brain must have fallen asleep there for a while.

        Pushing tablenr will fix the recursive table though. You still increase $count each time you encounter a table start. You also set $tablenr to $count. Its only on table end that you restore $tablenr, but you do _not_ restore $count. Hence, each table is given a unique, increasing number.

        Have you tried it btw? I haven't. so I'm only speaking theoretically here....

        regards,

        GoldClaw