zzspectrez has asked for the wisdom of the Perl Monks concerning the following question:
Here is my first attempt at using HTML::Parser. Up to now I have been using perls patern matching abilities to extract the data I need from html files. I know this is bad.
So I have been trying to figure out how to get HTML::Parser to work for my needs. I wrote a perl script that downloads my bank information from my banks secure site using LWP. Now I want to extract just the account information. The layout of the page make use of a layering of multiple tables. I wasnt sure the best way to do this. I located HTML::TableExtract on cpan which should do what I need. Looking over its doc's it seems more usefull for situations where the tables have headers which this has none. I need to be able to get the text from a specific table row column. I dont think this module does it. So I made the following which works.. Would like suggestions on how I could Improve it. Here is a stripped down version of what Im doing. To get at the data do something like $table[12][1][2] which is the text in row 1, column 2 of the 12 table in the html file. Indexes are based off 1 not 0.
#!/usr/local/bin/perl -w use strict; use HTML::Parser; my @table; my @save; my $count = 0; my $row = 0; my $column = 0; my $in_table = 0; my $p = HTML::Parser->new( api_version =>3, handlers => [ start => [\&_start, "tagname, +attr"], end => [\&_end, "t +agname"], text => [\&_text, "d +text"], ], marked_sections =>1, ); $p->parse_file('test.html'); sub _start { my ($tag, $attr) = shift; if ($tag eq 'table'){ push @save, [$row,$column]; $row = $column = 0; ++$count; $in_table++; } $row++ if ($tag eq 'tr'); $column++ if ($tag eq 'td'); } sub _end { my ($tag, $attr) = shift; if ($tag eq 'table') { ($row, $column) = @{ pop @save }; --$in_table; } $column = 0 if ($tag eq 'tr'); } sub _text { my $text = shift; chomp $text; $text =~ s/\xa0//; # some reason data has bunch of \xA0 characters ? +? ?? return unless $text; $table[$count][$row][$column] .= $text if ($in_table) && ($text !~ m +/^\s+$/); } ## print data print 'ACCOUNT: ',$table[12][1][2], "\n"; print 'BALANCE: ',$table[12][1][3], "\n"; print 'AVAILABLE: ',$table[12][1][4], "\n";
Thanks!
zzSPECTREz
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Using HTML::Parser extract text from tables
by OeufMayo (Curate) on Jan 16, 2001 at 14:07 UTC | |
|
Re: Using HTML::Parser extract text from tables
by goldclaw (Scribe) on Jan 16, 2001 at 17:44 UTC | |
by zzspectrez (Hermit) on Jan 18, 2001 at 12:07 UTC | |
by zzspectrez (Hermit) on Jan 17, 2001 at 06:28 UTC | |
by Anonymous Monk on Jan 17, 2001 at 15:45 UTC |