Here is my first attempt at using HTML::Parser. Up to now I have been using perls patern matching abilities to extract the data I need from html files. I know this is bad.

So I have been trying to figure out how to get HTML::Parser to work for my needs. I wrote a perl script that downloads my bank information from my banks secure site using LWP. Now I want to extract just the account information. The layout of the page make use of a layering of multiple tables. I wasnt sure the best way to do this. I located HTML::TableExtract on cpan which should do what I need. Looking over its doc's it seems more usefull for situations where the tables have headers which this has none. I need to be able to get the text from a specific table row column. I dont think this module does it. So I made the following which works.. Would like suggestions on how I could Improve it. Here is a stripped down version of what Im doing. To get at the data do something like $table[12][1][2] which is the text in row 1, column 2 of the 12 table in the html file. Indexes are based off 1 not 0.

#!/usr/local/bin/perl -w use strict; use HTML::Parser; my @table; my @save; my $count = 0; my $row = 0; my $column = 0; my $in_table = 0; my $p = HTML::Parser->new( api_version =>3, handlers => [ start => [\&_start, "tagname, +attr"], end => [\&_end, "t +agname"], text => [\&_text, "d +text"], ], marked_sections =>1, ); $p->parse_file('test.html'); sub _start { my ($tag, $attr) = shift; if ($tag eq 'table'){ push @save, [$row,$column]; $row = $column = 0; ++$count; $in_table++; } $row++ if ($tag eq 'tr'); $column++ if ($tag eq 'td'); } sub _end { my ($tag, $attr) = shift; if ($tag eq 'table') { ($row, $column) = @{ pop @save }; --$in_table; } $column = 0 if ($tag eq 'tr'); } sub _text { my $text = shift; chomp $text; $text =~ s/\xa0//; # some reason data has bunch of \xA0 characters ? +?&nbsp?? return unless $text; $table[$count][$row][$column] .= $text if ($in_table) && ($text !~ m +/^\s+$/); } ## print data print 'ACCOUNT: ',$table[12][1][2], "\n"; print 'BALANCE: ',$table[12][1][3], "\n"; print 'AVAILABLE: ',$table[12][1][4], "\n";

Thanks!
zzSPECTREz


In reply to Using HTML::Parser extract text from tables by zzspectrez

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.