comment on

Here is my first attempt at using HTML::Parser. Up to now I have been using perls patern matching abilities to extract the data I need from html files. I know this is bad.

So I have been trying to figure out how to get HTML::Parser to work for my needs. I wrote a perl script that downloads my bank information from my banks secure site using LWP. Now I want to extract just the account information. The layout of the page make use of a layering of multiple tables. I wasnt sure the best way to do this. I located HTML::TableExtract on cpan which should do what I need. Looking over its doc's it seems more usefull for situations where the tables have headers which this has none. I need to be able to get the text from a specific table row column. I dont think this module does it. So I made the following which works.. Would like suggestions on how I could Improve it. Here is a stripped down version of what Im doing. To get at the data do something like $table[12][1][2] which is the text in row 1, column 2 of the 12 table in the html file. Indexes are based off 1 not 0.

#!/usr/local/bin/perl -w
use strict;
use HTML::Parser;

my  @table;
my  @save;
my  $count    = 0;
my  $row      = 0;
my  $column   = 0;
my  $in_table = 0;

my  $p = HTML::Parser->new( api_version    =>3,
                    handlers       => [ start => [\&_start, "tagname, 
+attr"],
                                                end   => [\&_end,   "t
+agname"],
                                                text  => [\&_text,  "d
+text"],
                                              ],
                            marked_sections =>1,
                           );

$p->parse_file('test.html');

sub _start {
  my ($tag, $attr) = shift;
  if ($tag eq 'table'){
    push @save, [$row,$column];
    $row = $column = 0;
    ++$count;
    $in_table++;
  }
  $row++    if ($tag eq 'tr');
  $column++ if ($tag eq 'td');  
}

sub _end {
  my ($tag, $attr) = shift;
  if ($tag eq 'table') {
    ($row, $column) = @{ pop @save };
    --$in_table;
  }
  $column = 0 if ($tag eq 'tr');
}

sub _text {
  my $text = shift;
  chomp $text;
  $text =~ s/\xa0//; # some reason data has bunch of \xA0 characters ?
+?&nbsp??
  return unless $text;
  $table[$count][$row][$column] .= $text if ($in_table) && ($text !~ m
+/^\s+$/);
}

## print data
print 'ACCOUNT:   ',$table[12][1][2], "\n";
print 'BALANCE:   ',$table[12][1][3], "\n";
print 'AVAILABLE: ',$table[12][1][4], "\n";
[download]

Thanks!
zzSPECTREz

In reply to Using HTML::Parser extract text from tables by zzspectrez

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.