I've been having a hard time figuring out perl modules, and have only been trying some simple perl code after a decade of no programming at all.

So, I take the snippet of code from HTML::Parser as listed in the 3rd example, only changing title to table:
use HTML::Parser (); sub start_handler { return if shift ne "table"; my $self = shift; $self->handler(text => sub { print shift }, "dtext"); $self->handler(end => sub { shift->eof if shift eq "table"; }, "tagname,self"); } my $p = HTML::Parser->new(api_version => 3); $p->handler( start => \&start_handler, "tagname,self"); $p->parse_file(shift || die) || die $!; print "\n";
Now, my boss (who knows a bit more practical experience with perl) and I have been trying different things to brute force data extraction, but usually wound up with a ton of tags and other XML garbage printing out.

If running the code above on an example saved from here.

Everything comes out fine, except a lot of the paragraph tags/TR have nbsp's in them, that under Active Perl show up as accented A's.

So far, neither of us has been able to remove/skip the nbsp's, and/or ignore them so they are not counted as part of the output.
Now the whole point as I understand is to eventually dump this data into an Oracle db, if we can get past this current bump.

And it seems that among the Parser, Extractor, TableExtract there is a bit of everything we need, but I'll be darned if I can figure out what and where it goes after 2 weeks of reading.

If anyone cares to play "Help The Idjit", many thanks.
Adding comments to the above code, if you would be so kind, and help me understand WTH is going on. (i.e. Talk to me like a bright 5 year old {grin}).

In reply to Should I use; Html Parser, table extract, Extractor by a_non_moose

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.