in reply to Using HTML::Parser for simple tag removal

What does parse_file return? The stripped text? No ... if only it were that easy. Actually, it is that easy with HTML::TokeParser::Simple (just see the first example), but you want to learn this module. I'll give you a hint -- you have to specify a callback subroutine for HTML::Parser so that when it processes a text 'event' it knows what to do with it. If all you want is the text, then this should be enough:

my $p = HTML::Parser->new( api_version => 3, text_h => [ sub {print shift}, "dtext" ], ); $p->parse_file('somefile.html') || die "could not parse HTML file\n";
HTML::Parser is not easy. I recommend using HTML::TokeParser or HTML::TokeParser::Simple if you just "want to get it done," otherwise, you have a lot more reading to do. :) Try a Super Search here at the Monastery for "HTML::Parser" and you might find a lot of useful examples.

Update: I just did a quick Super Search on my nodes, perhaps (jeffa) Re: Regexp to ignore HTML tags will be of use to you.

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Replies are listed 'Best First'.
Re^2: Using HTML::Parser for simple tag removal
by bradcathey (Prior) on May 25, 2005 at 20:53 UTC

    Thanks jeffa, worked first time. I will definitely look at HTML::TokeParser::Simple as an alternative.

    Since I came here almost 2 years ago, the mantra I've heard is "use modules," but, as you hinted, I'm finding some modules much easier to use than others, and the docs are usually not written for graphic designers ;-)


    —Brad
    "The important work of moving the world forward does not wait to be done by perfect men." George Eliot