What does parse_file return? The stripped text? No ... if only it were that easy. Actually, it is that easy with HTML::TokeParser::Simple (just see the first example), but you want to learn this module. I'll give you a hint -- you have to specify a callback subroutine for HTML::Parser so that when it processes a text 'event' it knows what to do with it. If all you want is the text, then this should be enough:
HTML::Parser is not easy. I recommend using HTML::TokeParser or HTML::TokeParser::Simple if you just "want to get it done," otherwise, you have a lot more reading to do. :) Try a Super Search here at the Monastery for "HTML::Parser" and you might find a lot of useful examples.my $p = HTML::Parser->new( api_version => 3, text_h => [ sub {print shift}, "dtext" ], ); $p->parse_file('somefile.html') || die "could not parse HTML file\n";
Update: I just did a quick Super Search on my nodes, perhaps (jeffa) Re: Regexp to ignore HTML tags will be of use to you.
jeffa
L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)
In reply to Re: Using HTML::Parser for simple tag removal
by jeffa
in thread Using HTML::Parser for simple tag removal
by bradcathey
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |