I'm reposting my response, since the question had been originally posted to Perl Monks Discussion
I'm not a seasoned user of HTML::Parser, but
I believe it calls a function for each opening and closing tag
it encounters, and for each piece of text between tags. If
that's the case, you can set special flags when you encounter
certain opening tags, and then store all the text in a variable
until you encounter the corresponding closing tag, at which
point you can store the text wherever you want. Using the
HTML::Parser version 2 subclassing, something like this:
(untested code, based on sample code from the HTML::Parser
documentation)
{
package MyParser;
use base 'HTML::Parser';
sub start {
my($self, $tagname, $attr, $attrseq, $origtext) = @_;
if($tagname eq 'blockquote') {
$capturing{blockquote}=1;
$text{blockquote}="";
}
}
sub end {
my($self, $tagname, $origtext) = @_;
$capturing{blockquote}=0 if $tagname eq 'blockquote';
# Do whatever you want to do with $text{blockquote}
}
sub text {
my($self, $origtext, $is_cdata) = @_;
$text{blockquote}.=$origtext if $capturing{blockquote};
}
}
my $p = MyParser->new;
$p->parse_file("foo.html");
This will capture all the text between BLOCKQUOTE tags. Of
course, you can do more complex rules for capturing what you
want and storing it where you want it, but the general idea
should be the same.
--ZZamboni
| [reply] [d/l] |
Why can you only read one line at a time? HTML has no internal line breaks (\n is meaningless in an HTML file except in <pre> blocks )
Why not read the whole file first and parse it that way?
It wouldn't take much then to find the plain text.
Easy ways to read the whole file include:
@file = <FILEHANDLE>;
# or:
{
local $/ = undef;
$file = <FILEHANDLE>;
}
| [reply] [d/l] |
The reason is that i need to parse out the html tags in the file and print them to browser. The way i am doing this is as follows: take in a line, see if there is a starting tag (like <body> or <html>) and send it to $start, check if there is an ending tag (like </table>, </html> etc.) and send that to $end and finally see if there is plain text and send that to $text. This works just peachy for the printing out just the html tags to the browser, but now i am left with several $text(s) and i don't know how to keep blocks of text together.
Shaheeb
| [reply] [d/l] [select] |
I didn't ask why you need to parse the html. I asked why you can't read the whole file. But I see what the issue is now, You are worried about font tags and line breaks breaking up your plain text. My advice is either forget all the breaks and join ' ', @lines or use a state machine to keep track of where you are as you parse.
| [reply] |