By "blocks of text" I thought about the pieces between
<P> and <br> tags. Still not sure what you mean
by titles, though...
If your HTML uses tables, it would make this a lot easier.
Good luck.
Russ
Brainbench 'Most Valuable Professional' for Perl | [reply] |
First, this should have been posted to Seekers of Perl Wisdom, not here.
Second, I'm not a seasoned user of HTML::Parser, but
I believe it calls a function for each opening and closing tag
it encounters, and for each piece of text between tags. If
that's the case, you can set special flags when you encounter
certain opening tags, and then store all the text in a variable
until you encounter the corresponding closing tag, at which
point you can store the text wherever you want. Using the
HTML::Parser version 2 subclassing, something like this:
(untested code, based on sample code from the HTML::Parser
documentation)
{
package MyParser;
use base 'HTML::Parser';
sub start {
my($self, $tagname, $attr, $attrseq, $origtext) = @_;
if($tagname eq 'blockquote') {
$capturing{blockquote}=1;
$text{blockquote}="";
}
}
sub end {
my($self, $tagname, $origtext) = @_;
$capturing{blockquote}=0 if $tagname eq 'blockquote';
# Do whatever you want to do with $text{blockquote}
}
sub text {
my($self, $origtext, $is_cdata) = @_;
$text{blockquote}.=$origtext if $capturing{blockquote};
}
}
my $p = MyParser->new;
$p->parse_file("foo.html");
This will capture all the text between BLOCKQUOTE tags. Of
course, you can do more complex rules for capturing what you
want and storing it where you want it, but the general idea
should be the same.
--ZZamboni
| [reply] [d/l] |
$MULTILINE_MATCHING
$*
Set to 1 to do multi-line matching within a string, 0 to tell Perl that it can assume that strings contain a single line, for the purpose of optimizing pattern matches. Pattern matches on strings containing multiple newlines can produce confusing results when ``$*'' is 0. Default is 0. (Mnemonic: * matches multiple things.) Note that this variable influences the interpretation of only ``^'' and ``$''. A literal newline can be searched for even when $* == 0.
$.
The current input line number for the last file handle from which you read (or performed a seek or tell on). An explicit close on a filehandle resets the line number. Because ``<>'' never does an explicit close, line numbers increase across ARGV files (but see examples under eof()). Localizing $. has the effect of also localizing Perl's notion of ``the last read filehandle''. (Mnemonic: many programs use ``.'' to mean the current line number.)
______________________________________________
|_____¸.·ooO--(> cRaZy is co01. <)--Ooo·.¸_____|
ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ
| [reply] |