Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise ones. I am working on parsing out all the plain text in a given web page and inserting those pieces into editable text boxes. Here is my problem: i am using the HTML::Parser module and i can only deal with the file line by line. I really want to keep blocks of text together rather than break up each line of text into its own box. And i also want titles to have their own box. I am busting my head on this logic. i don't think i will have any problem with the actual programming, but i can't come up with the right rules for parsing out the file. pleeease help me!! Shaheeb R.

Replies are listed 'Best First'.
RE: I need help with some logic
by Russ (Deacon) on Jul 11, 2000 at 06:21 UTC
    By "blocks of text" I thought about the pieces between <P> and <br> tags. Still not sure what you mean by titles, though...

    If your HTML uses tables, it would make this a lot easier.

    Good luck.

    Russ
    Brainbench 'Most Valuable Professional' for Perl

RE: I need help with some logic
by ZZamboni (Curate) on Jul 11, 2000 at 18:07 UTC
    First, this should have been posted to Seekers of Perl Wisdom, not here.

    Second, I'm not a seasoned user of HTML::Parser, but I believe it calls a function for each opening and closing tag it encounters, and for each piece of text between tags. If that's the case, you can set special flags when you encounter certain opening tags, and then store all the text in a variable until you encounter the corresponding closing tag, at which point you can store the text wherever you want. Using the HTML::Parser version 2 subclassing, something like this: (untested code, based on sample code from the HTML::Parser documentation)

    { package MyParser; use base 'HTML::Parser'; sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; if($tagname eq 'blockquote') { $capturing{blockquote}=1; $text{blockquote}=""; } } sub end { my($self, $tagname, $origtext) = @_; $capturing{blockquote}=0 if $tagname eq 'blockquote'; # Do whatever you want to do with $text{blockquote} } sub text { my($self, $origtext, $is_cdata) = @_; $text{blockquote}.=$origtext if $capturing{blockquote}; } } my $p = MyParser->new; $p->parse_file("foo.html");
    This will capture all the text between BLOCKQUOTE tags. Of course, you can do more complex rules for capturing what you want and storing it where you want it, but the general idea should be the same.

    --ZZamboni

(crazyinsomniac) RE: I need help with some logic
by crazyinsomniac (Prior) on Jul 11, 2000 at 12:21 UTC
    $MULTILINE_MATCHING $*
    Set to 1 to do multi-line matching within a string, 0 to tell Perl that it can assume that strings contain a single line, for the purpose of optimizing pattern matches. Pattern matches on strings containing multiple newlines can produce confusing results when ``$*'' is 0. Default is 0. (Mnemonic: * matches multiple things.) Note that this variable influences the interpretation of only ``^'' and ``$''. A literal newline can be searched for even when $* == 0.
    $.
    The current input line number for the last file handle from which you read (or performed a seek or tell on). An explicit close on a filehandle resets the line number. Because ``<>'' never does an explicit close, line numbers increase across ARGV files (but see examples under eof()). Localizing $. has the effect of also localizing Perl's notion of ``the last read filehandle''. (Mnemonic: many programs use ``.'' to mean the current line number.)
     ______________________________________________
    |_____¸.·ooO--(> cRaZy is co01. <)--Ooo·.¸_____|
     ŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻŻ