shaba has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise ones. I am working on parsing out all the plain text in a given web page and inserting those pieces into editable text boxes. Here is my problem: i am using the HTML::Parser module and i can only deal with the file line by line. I really want to keep blocks of text together rather than break up each line of text into its own box. And i also want titles to have their own box. I am busting my head on this logic. i don't think i will have any problem with the actual programming, but i can't come up with the right rules for parsing out the file. pleeease help me!! Shaheeb R.

Replies are listed 'Best First'.
(ZZamboni) Re: I need help with some logic
by ZZamboni (Curate) on Jul 11, 2000 at 22:13 UTC
    I'm reposting my response, since the question had been originally posted to Perl Monks Discussion

    I'm not a seasoned user of HTML::Parser, but I believe it calls a function for each opening and closing tag it encounters, and for each piece of text between tags. If that's the case, you can set special flags when you encounter certain opening tags, and then store all the text in a variable until you encounter the corresponding closing tag, at which point you can store the text wherever you want. Using the HTML::Parser version 2 subclassing, something like this: (untested code, based on sample code from the HTML::Parser documentation)

    { package MyParser; use base 'HTML::Parser'; sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; if($tagname eq 'blockquote') { $capturing{blockquote}=1; $text{blockquote}=""; } } sub end { my($self, $tagname, $origtext) = @_; $capturing{blockquote}=0 if $tagname eq 'blockquote'; # Do whatever you want to do with $text{blockquote} } sub text { my($self, $origtext, $is_cdata) = @_; $text{blockquote}.=$origtext if $capturing{blockquote}; } } my $p = MyParser->new; $p->parse_file("foo.html");
    This will capture all the text between BLOCKQUOTE tags. Of course, you can do more complex rules for capturing what you want and storing it where you want it, but the general idea should be the same.

    --ZZamboni

Re: I need help with some logic.
by Adam (Vicar) on Jul 11, 2000 at 20:23 UTC
    Why can you only read one line at a time? HTML has no internal line breaks (\n is meaningless in an HTML file except in <pre> blocks ) Why not read the whole file first and parse it that way? It wouldn't take much then to find the plain text. Easy ways to read the whole file include:
    @file = <FILEHANDLE>; # or: { local $/ = undef; $file = <FILEHANDLE>; }
      The reason is that i need to parse out the html tags in the file and print them to browser. The way i am doing this is as follows: take in a line, see if there is a starting tag (like <body> or <html>) and send it to $start, check if there is an ending tag (like </table>, </html> etc.) and send that to $end and finally see if there is plain text and send that to $text. This works just peachy for the printing out just the html tags to the browser, but now i am left with several $text(s) and i don't know how to keep blocks of text together. Shaheeb
        I didn't ask why you need to parse the html. I asked why you can't read the whole file. But I see what the issue is now, You are worried about font tags and line breaks breaking up your plain text. My advice is either forget all the breaks and join ' ', @lines or use a state machine to keep track of where you are as you parse.