HTML::Parser - getting all contained HTML?

howie has asked for the wisdom of the Perl Monks concerning the following question:

I have a Blogger-based weblog, with special DIV tags around the individual posts, so I can parse the resulting file. The tags are like <div CLASS="post" ID="5736047" UNUSEDATTRIBUTE="9/17/2001 01:31:37 PM"> and can contain any HTML (although usually it's fairly basic formatting).

I have an HTML::Parser based search function for my site that can parse these files, but as a side effect, I lose all HTML tags between the DIVs. At the moment, that's OK because I can then grep more reliably. However, for a different project, I'd like the same thing but with the original HTML between the DIVs - how does this work in HTML::Parser 3? I have to confess I don't really understand how the parser is structured...

Here's the current guts of the search, basically adapted from one of the examples:

my $p = HTML::Parser->new(api_version => 3, start_h => [\&div_start_ha
+ndler, "self,tagname,attr"]);

$p->parse_file($file);
do_stuff();

sub div_start_handler
{
    my($self, $tag, $attr) = @_;
    my($blogdate);  

    return unless ($tag eq "div");
    return unless exists $attr->{class};
    return unless $attr->{class} eq 'post';

    #global, so the endhandler knows what the last ID was
    $blogid = $attr->{id};
    $BlogArticles{$blogid}{DateText} = $attr->{unusedattribute};

    $self->handler(text  => [], "dtext" );
    $self->handler(end   => \&div_end_handler, "self,tagname");
}

sub div_end_handler
{   
    my($self, $tag) = @_;

    return unless $tag eq "div";

    my $text = join("", map $_->[0], @{$self->handler("text")});
    $BlogArticles{$blogid}{BodyText} = $text;

    $self->handler("text", undef);
    $self->handler("start", \&div_start_handler);
    $self->handler("end", undef);
}
[download]

Comment on HTML::Parser - getting all contained HTML? Download Code

Replies are listed 'Best First'.
(crazyinsomniac) Re: HTML::Parser - getting all contained HTML? by crazyinsomniac (Prior) on Sep 18, 2001 at 09:56 UTC
but as a side effect, I lose all HTML tags between the DIVs ??? HTML::Parser "tokenizes" all input (for the most part), and calls appropriate handlers. What happens to the "html" is that it gets parsed, turned into tokens, passeed as arguments to the handlers... To preserve the html, you have to recreate it out of the tokens, and store it someplace... Looking at you code snippet, and what you're trying to do, it looks like you would be better off using HTML::TokeParser (an alternative interface to HTML::Parser, where you don't setup "handlers" which process the data automatically, but you "pull" tokens out of the data, and are able to "seek" back and forth through the file). There is a tutorial, incidentally by me, in the Tutorials section, aptly named, HTML::TokeParser Tutorial. I suggest you also take a look at the XML::Parser Tutorial, as it seems you'd also be better off using proper XML to store your "data" (btw - the HTML::Parser and XML::Parser interfaces are very very similar - only few "name" changes ;D). ___crazyinsomniac_______________________________________ `Disclaimer: Don't blame. It came from inside the void` `perl -e "$q=$_;map({chr unpack qq;H;,$_}split(q;;,qH*));print;$q/$q;"`	[reply]
Re: (crazyinsomniac) Re: HTML::Parser - getting all contained HTML? by Anonymous Monk on Sep 20, 2001 at 03:32 UTC
Having had a good hard look at the docs, I figured it out using HTML::Parser directly, and (as you said) adding all the handlers, and reconstructing the appropriate parts of the document. I know I'd be better off with XML, but this was a one-off conversion from the HTML docs into something more 'edible'. Next time I have something like this, I'll try TokeParser.	[reply]