Use Parsers To Get Chunk of HTML?

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

Slight corollary question to my previous scraper node -- although you don't need to read it to answer this one -- I need to extract links from a page, but not all of them.

Now sometimes I can find the links I want on an HTML page just by matching a URL pattern. This method is amenable to parsing with Toke::Parser or similar.

But say a site uses a completely opaque URL format like "?storyid=123456" for everything?

What I've done in the past is to find the chunk of the page which contains those "good" links as a way to exclude the "bad" ones. And I've done it the "dumb" way, i.e.

    $whole_thing =~
      m|<some unique html start string>(.*?)<end string>|s;
    $good_chunk = $1;
[download]

and then working on the $good_chunk.

I've spent a bit of time looking at Toke::Parser and HTML::Parser and I can't seem to figure out how to do the equivalent.

Say I've determined that what I need is

<div id="good_chunk">
[download]

up to the closing tag of that DIV.

I need something like

while ( my $token = $p->get_tag( "div" ) ) {
    if ( $token->[1]->{'id'} eq 'good_chunk' ){
        # get the entire contents of the div, as HTML,
        # for further parsing
    }
}
[download]

Perhaps I'm missing something obvious?

($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Comment on Use Parsers To Get Chunk of HTML? Select or Download Code

Replies are listed 'Best First'.
Re: Use Parsers To Get Chunk of HTML? by merzy (Scribe) on Jul 04, 2005 at 04:22 UTC
I've become a big fan of HTML::TreeBuilder for this sort of thing. If I understand your question, you'd do something like: `use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new_from_content($whole_thing); $tree->elementify(); my $good_chunk = $tree->look_down("_tag","div","id","good_chunk"); my $links_ref = $good_chunk->extract_links; my $good_chunk_html = $good_chunk->as_HTML;` [download]	[reply] [d/l]
Re^2: Use Parsers To Get Chunk of HTML? by Cody Pendant (Prior) on Jul 04, 2005 at 04:33 UTC
Wow, that worked straight away! Brilliant stuff. Thank you. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]
Re: Use Parsers To Get Chunk of HTML? by GrandFather (Saint) on Jul 04, 2005 at 04:17 UTC
Take a look at HTML::TreeBuilder. It builds a tree representing the HTML in memorywhich you can then extract information from in various ways. Perl is Huffman encoded by design.	[reply]
Re: Use Parsers To Get Chunk of HTML? by polettix (Vicar) on Jul 04, 2005 at 10:45 UTC
`$whole_thing =~ m\|<some unique html start string>(.?)<end string>\|s; $good_chunk = $1;` [download] The matching could fail here, so you should check before using `$1`, otherwise you'll get the value remaining from the previous positive evaluation. You could also evaluate in list context: `($good_chunk) = $whole_thing =~ m\|<some unique html start string>(.?)<end string>\|s;` [download] even if readability could suffer a bit here. This will assign `$1` to `$good_chunk` if the regex matches, `undef` otherwise. Flavio perl -ple'$_=reverse' <<<ti.xittelop@oivalf Don't fool yourself.	[reply] [d/l] [select]
Re^2: Use Parsers To Get Chunk of HTML? by Cody Pendant (Prior) on Jul 05, 2005 at 23:17 UTC
Thanks for that, good point. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]