A real parser (like HTML::Parser) will take care of all the icky little details (like quantity / ordering of irrelevant tag attributes, presence / absence of quotes on attribute values, etc), which have nothing to do with the structural content you're after, but are hard to get right and may change arbitrarily when web services get updated, making your regex fail.
Here's a version of your script, based loosely on the first offering in the "EXAMPLES" section of the HTML::Parser man page -- I think it does what you want:
Now, I realize that getting your head around that and understanding how it works might be a challenge if you're not acquainted with parsing modules in general. But if you RTFM and take the time to figure it out, it will be worth your while the next time you need to do this sort of task.#!/usr/bin/perl use strict; use WWW::Mechanize; use HTML::Parser; sub start_handler { my ( $tag, $attr, $self ) = @_; return unless ( $tag eq 'center' and $$attr{class} eq 'categories' + ); $self->handler( text => sub { print shift }, 'text' ); $self->handler( end => sub { shift->eof if shift eq 'center'; }, 'tagname,self' ); } my $mech = WWW::Mechanize->new(); $mech->get("http://search.cpan.org/"); my $parser = HTML::Parser->new(api_version => 3, start_h => [ \&start_handler, "tagname, +attr,self" ]); $parser->parse( $mech->content());
Update: On looking at the last $self->handler call, I realized that it would be easier to comprehend if the sub written so that its statement syntax was consistent with the order of args given to it -- that is:
$self->handler( end => sub { if(shift eq 'center'){shift->eof} }, "tag +name,self" );
In reply to Re: Grab sections of text
by graff
in thread Grab sections of text
by uni_j
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |