comment on

The problem you ran into -- and the kind of solution offered in the first reply -- is a good reason for learning how to do this sort of thing with a real parsing approach, rather than with regex matching.

A real parser (like HTML::Parser) will take care of all the icky little details (like quantity / ordering of irrelevant tag attributes, presence / absence of quotes on attribute values, etc), which have nothing to do with the structural content you're after, but are hard to get right and may change arbitrarily when web services get updated, making your regex fail.

Here's a version of your script, based loosely on the first offering in the "EXAMPLES" section of the HTML::Parser man page -- I think it does what you want:

#!/usr/bin/perl

use strict;
use WWW::Mechanize;
use HTML::Parser;

sub start_handler {
    my ( $tag, $attr, $self ) = @_;
    return unless ( $tag eq 'center' and $$attr{class} eq 'categories'
+ );
    $self->handler( text => sub { print shift }, 'text' );
    $self->handler( end => sub { shift->eof if shift eq 'center'; },
                    'tagname,self' );
}

my $mech = WWW::Mechanize->new();

$mech->get("http://search.cpan.org/");

my $parser = HTML::Parser->new(api_version => 3,
                               start_h => [ \&start_handler, "tagname,
+attr,self" ]);

$parser->parse( $mech->content());
[download]

Now, I realize that getting your head around that and understanding how it works might be a challenge if you're not acquainted with parsing modules in general. But if you RTFM and take the time to figure it out, it will be worth your while the next time you need to do this sort of task.

Update: On looking at the last $self->handler call, I realized that it would be easier to comprehend if the sub written so that its statement syntax was consistent with the order of args given to it -- that is:

$self->handler( end => sub { if(shift eq 'center'){shift->eof} }, "tag
+name,self" );
[download]

In reply to Re: Grab sections of text by graff
in thread Grab sections of text by uni_j

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.