Grab sections of text

uni_j has asked for the wisdom of the Perl Monks concerning the following question:

Hey guys, I'm trying to pull a segment of text from a variable. In this case I am testing on CPAN and would like to pull only the HTML that is in between my $s and $e tags. My start value isn't returning a match and I'm trying to figure out 1)what I am doing wrong here and 2) what other way can I use to grab segments of code from and html page (so it is easier to parse individual nodes). Thanks monks !

use WWW::Mechanize;

my $mech = WWW::Mechanize->new();

$mech->get("http://search.cpan.org/");
$html = $mech->content();

$s = '<center class="categories">';
$e = '</center>';

$html =~ s/\r//g;
$html =~ s/\n/ /g;

my $start = index($html,$s);

print $start;
[download]

Comment on Grab sections of text Download Code

Replies are listed 'Best First'.
Re: Grab sections of text by graff (Chancellor) on Jan 14, 2010 at 00:36 UTC
The problem you ran into -- and the kind of solution offered in the first reply -- is a good reason for learning how to do this sort of thing with a real parsing approach, rather than with regex matching. A real parser (like HTML::Parser) will take care of all the icky little details (like quantity / ordering of irrelevant tag attributes, presence / absence of quotes on attribute values, etc), which have nothing to do with the structural content you're after, but are hard to get right and may change arbitrarily when web services get updated, making your regex fail. Here's a version of your script, based loosely on the first offering in the "EXAMPLES" section of the HTML::Parser man page -- I think it does what you want: #!/usr/bin/perl use strict; use WWW::Mechanize; use HTML::Parser; sub start_handler { my ( $tag, $attr, $self ) = @_; return unless ( $tag eq 'center' and $$attr{class} eq 'categories' + ); $self->handler( text => sub { print shift }, 'text' ); $self->handler( end => sub { shift->eof if shift eq 'center'; }, 'tagname,self' ); } my $mech = WWW::Mechanize->new(); $mech->get("http://search.cpan.org/"); my $parser = HTML::Parser->new(api_version => 3, start_h => [ \&start_handler, "tagname, +attr,self" ]); $parser->parse( $mech->content()); [download] Now, I realize that getting your head around that and understanding how it works might be a challenge if you're not acquainted with parsing modules in general. But if you RTFM and take the time to figure it out, it will be worth your while the next time you need to do this sort of task. Update: On looking at the last `$self->handler` call, I realized that it would be easier to comprehend if the sub written so that its statement syntax was consistent with the order of args given to it -- that is: `$self->handler( end => sub { if(shift eq 'center'){shift->eof} }, "tag +name,self" );` [download]	[reply] [d/l] [select]
Re: Grab sections of text by Fox (Pilgrim) on Jan 13, 2010 at 18:02 UTC
I guess the problem is in the $s variable, looking in the source I see `<center class=categories>`, notice there's no quoting in "categories". this works for me: `use LWP::Simple; $html = get "http://search.cpan.org/"; $s = '<center class=categories>'; $e = '</center>'; $txt = substr $html, index($html,$s), index($html,$e); print "$txt\n";` [download]	[reply] [d/l] [select]
Re: Grab sections of text by wfsp (Abbot) on Jan 14, 2010 at 12:19 UTC
I agree with graff's approach. Perhaps a gentler introduction to parsing HTML would be to avoid HTML::Parser's "handlers" (which I find can be tricky) and consider something like HTML::TokeParser::Simple. To my eye it often "reads" better, ymmv. #!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TokeParser::Simple; my $html = get(qq{http://search.cpan.org/}); my $p = HTML::TokeParser::Simple->new(\$html); my $start_found; while (my $t = $p->get_token){ $start_found++, last if ( $t->is_start_tag(q{center}) and $t->get_attr(q{class}) and $t->get_attr(q{class}) eq q{categories} ); } die qq{tag not found} unless $start_found; my $center_html; while (my $t = $p->get_token){ last if ( $t->is_end_tag(q{center}) ); $center_html .= $t->as_is; } die qq{no content found} unless $center_html; print $center_html; [download]	[reply] [d/l]