Sailor99 has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to scrape successive web pages by using
 $mech->find_link( text => "Next>>" ); 
. (This appears as "Next>>" in a web browser, but that doesn't work). The "Next>>" doesn't work either. Does anyone have a clue as to how to handle this? I can't just use the work "Next" (in the regex form) as it's not specific enough. Thanks in advance for any insights.

Replies are listed 'Best First'.
Re: Mech follow_link
by merlyn (Sage) on Feb 12, 2007 at 03:15 UTC
    The source says:
    return if defined $p->{text} && !(defined($link->text) && + $link->text eq $p->{text} );
    so it's doing a literal "eq". What does $link->text return for that link? Is it possible that there's some whitespace on either side?
Re: Mech follow_link
by Tanktalus (Canon) on Feb 12, 2007 at 03:15 UTC

    Perhaps WWW::Mechanize is unescaping it, and it really is text => 'Next>>'?

    (There's not enough details for someone, like myself, with little experience with WWW::Mechanize, to come in and test such a theory, at least not in some trivially easy manner - e.g., code, URL.)

Re: Mech follow_link
by friedo (Prior) on Feb 12, 2007 at 03:20 UTC
    You can also use $mech->find_link( text_regex => qr/pattern/ ) to find links via regex.
Re: Mech follow_link
by jonsmith1982 (Beadle) on Feb 12, 2007 at 06:31 UTC
    heres some code that i did ....
    #!/usr/bin/perl -w use strict; use WWW::Mechanize; my $q = $ARGV[0] || die("\nusage : ".$0." query 3 spaces\n\n"); $q .= ' '.$ARGV[1] if ($ARGV[1]); $q .= ' '.$ARGV[2] if ($ARGV[2]); my $url = 'http://www.google.co.uk'; my $mech = WWW::Mechanize->new(agent => "WWW"); $mech->get($url); $mech->submit_form(form_name => 'f',fields => {'q' => $q}, button => ' +btnG'); for my $i (1..200) { print "page : ".$i."\n"; dosomat($mech->content); my $next = $mech->find_link( text_regex => qr/Next$/, url_regex => q +r/^\/search\?/) or die("finished on page : ".$i."\n"); $mech->get($url.$next->url); } exit(0); sub dosomat { my ($content) = @_; my @cont = $content =~ m{<div class=g(.*?)</div>}gsi; for my $i (0..$#cont) { print $cont[$i]."\n\n\n"; } }


    i had the same problem, found that the link had div tags within it.

    google wont let me use there services since i ran this script a couple of times, :(
Re: Mech follow_link
by ady (Deacon) on Feb 12, 2007 at 06:17 UTC
    Cut/paste from one of my programs, maybe that alternative will work for you
    while (!$done && $ie->follow_link(n=>2)) { # Click <-- ('previous') l +ink; ... }
    allan