Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am wondering what could be the issue with mechanize find_link in this case, The URL am trying is the below one

 my $url = "http://careers.republic.co.uk/pb3/corporate/Republic/search.php";

After clicking the First page button, it will lead us to a page which carries some jobs. There is a Next Page also. I am trying to automate to the next page using find_link of mechanizewith url regex(/page=2/) or text regex(/Next/). but it is giving me some other links

Pls can any monks help me out of this..... My code is below,

use strict; use WWW::Mechanize; my $search_url = "http://careers.republic.co.uk/pb3/corporate/Republic +/search.php"; my $mech = WWW::Mechanize->new(); eval{ $mech->agent_alias('Mac Safari'); $mech->get($search_url); $mech->click('p_bRun'); }; my $next = $mech->find_link( text_regex => qr#Next# ); # my $next = $mech->find_link( url_regex => qr#page=2# ); # Either wa +y not succeding print $next->url() . "\n"; #exit; my $filename = 'path.html'; $mech->save_content( $filename ); exit;

Pls help me out of this monks, Thanx to all.

Replies are listed 'Best First'.
Re: Wondering what could be the issue with mechanize find_link.
by choroba (Cardinal) on Jan 15, 2013 at 11:41 UTC
    The page you are trying to scrape uses a poor HTML markup: many <a> tags are not closed. WWW::Mechanize gets confused and puts all the table into one link:
    print $_->text, "\n" for $mech->find_all_links;
    However, you can count the pages yourself and build the URL from pieces, not getting it from the page:
    #!/usr/bin/perl use warnings; use strict; use WWW::Mechanize; my $search_url = 'http://careers.republic.co.uk/pb3/corporate/Republic +/search.php'; my $mech = WWW::Mechanize->new(); $mech->agent_alias('Mac Safari'); $mech->get($search_url); my $page_number = 1; PAGE: while (1) { $mech->get("$search_url?page=$page_number"); print " *** $page_number *** \n"; print $_->text, "\n" for $mech->find_all_links; last PAGE if $mech->content !~ /Next/; $page_number++; }
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ