ShayShay has asked for the wisdom of the Perl Monks concerning the following question:

<sigh> I've searched and tried all day and just can't figure this out. </sigh>

What I need is to go to a site, search for something, grab every href for each hit, then go to the next page of results and do the same until there's no more pages.

So, let's say I enter "Harry Potter" as my search term at www.mywebsite.com... The page that is returned to me has a form which includes what page I'm looking at as well as how many pages are left. There are 20 results per page. I need to grab each result and then go onto the next page by posting the form.

Does that make sense?

Here's what I've got. I can't even get it to go to page 2.

#!/usr/bin/perl -w use strict; use LWP::Simple; use WWW::Mechanize; use HTML::Form; my $url = 'http://www.ncbi.nlm.nih.gov/sites/entrez?term=rnr2&cmd=Sear +ch&db=nuccore&QueryKey=17'; my $browser = WWW::Mechanize->new; my $site = $browser->get($url); die( "Can't get $url -- ", $site->status_line ) unless $site->is_success; $browser->form('EntrezForm'); foreach my $item($browser->form('EntrezForm')){ my $nextPage = ""; my $maxPage = ""; my $field=""; my $fieldValue = ""; print "\n"."-----NewPage-----"."\n"; while( my ($k, $v) = each %$item ) { if ($k eq "action"){ my $action = $v; print "\n\n"."ACTION: ".$action."\n"; } if ($k eq "method"){ my $method = $v; print "\n\n"."METHOD: ".$method."\n"; } if ($k eq "attr") { print "\n\n"."ATTRIBUTES"."\n"; while( my ($k, $v) = each %$v ) { print "key: $k, value: $v.\n"; } } if ($k eq "inputs"){ print "\n\n"."INPUTS"."\n"; my @newarray = @$v; foreach my $thisItem(@newarray){ while (my($key, $value) = each %$thisItem){ if ( (($key eq "name") && ($value eq "EntrezSystem2. +PEntrez.Nuccore.Sequence_ResultsPanel.Pager.PageNumber"))|| (($key eq "name") && ($value eq "EntrezSystem2. +PEntrez.Nuccore.Sequence_ResultsPanel.Pager.MaxPage")) ) { $field = $value; if ($field =~ m/PageNumber/){$nextPage=($field +Value+1);$browser->set_fields("$field" => "$nextPage",);} if ($field =~ m/MaxPage/){$maxPage=$fieldValue +;} print $field." => ".$fieldValue."\n"; } if ($key eq "value"){ $fieldValue = $value; } } } } } #parse HTML to get <a>links</a> of each organism hit #save links to file for use after this big loop if ($nextPage <= $maxPage) { $browser->submit(); print "submit"; $browser->content; $browser->form('EntrezForm'); } }

Replies are listed 'Best First'.
Re: post, return, parse, repeat
by merlyn (Sage) on Feb 13, 2008 at 22:06 UTC
      Oh man, has it been that long? I subscribed to that magazine all the way back then just for that column! -A

      --
      By a scallop's forelocks!

      Wow. I'm about to finish my 10th year as a Perl programmer - thanks for making me feel young again!

      -sam

      Your column isn't exactly what I'm looking for. I am using Mechanize. I can't use GET at all because I get a 414 error. I need to use POST.
Re: post, return, parse, repeat
by igelkott (Priest) on Feb 14, 2008 at 00:53 UTC
    NCBI has a number of API's into their databases. A fairly simple model I used a few years ago is the Entrez Programming Utilities. "Esearch" and "Efetch" can be used to grab 100 records at a time in any of a number of formats. Restrictions and guidelines are posted on that page.
      Thank you! I'll try looking there! Still... I'd like to know what I'm doing wrong.
        NCBI uses javascript. To confirm, try switching it off in your browser and you'll see that clicking the next button just redirects to the main page.

        Normal mechanize can't handle this but there have been some discussions of how to get around this issue for IE and FF.

Re: post, return, parse, repeat
by samtregar (Abbot) on Feb 13, 2008 at 22:04 UTC
    You need to to give us more information since we can't run your test code (at least I'm not going to - others might be braver about making hits to a .gov site with no idea what it does). Just saying "it's not working" isn't enough.

    But hey, since I'm here, looking at your code made me wonder if you're getting values for $nextPage and $maxPage. It also made me wonder if just submitting the form was enough to go to the next page.

    -sam

      It's NCBI. It's a database of organisms and genomes. It's for the public, especially researchers, to use and submit to. It's nothing that is going to get anyone in any trouble. If it was, I certainly wouldn't have included the URL!