Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This is hard coded and hard to maintain and was wondering if anyone had suggestions on how to make the code a bit more condensed and less repititous.

And before some of you proclaim we're not allowed to scrape Google, you're wrong. If you sign up they allow X number of crawls per day and we're well within our limit on this one.

my $response = $ua->get("http://www.google.com/search?num=50&hl=en&lr= +&safe=off&rls=GGLD%2CGGLD%3A2005-12%2CGGLD%3Aen&q=$search"); if ($response->is_success) { &parser(0); } else { print "$response->status_line"; } my $response1 = $ua->get("http://www.google.com/search?q=$search&num=5 +0&hl=en&lr=&safe=off&start=100&sa=N"); if ($response1->is_success) { &parser(1); } else { print "$response1->status_line"; } my $response2 = $ua->get("http://www.google.com/search?q=$search&num=5 +0&hl=en&lr=&safe=off&start=150&sa=N"); if ($response2->is_success) { &parser(2); } else { print "$response2->status_line"; } my $response3 = $ua->get("http://www.google.com/search?q=$search&num=5 +0&hl=en&lr=&safe=off&start=200&sa=N"); if ($response3->is_success) { &parser(3); } else { print "$response3->status_line"; } my $response4 = $ua->get("http://www.google.com/search?q=$search&num=5 +0&hl=en&lr=&safe=off&start=250&sa=N"); if ($response4->is_success) { &parser(4); } else { print "$response4->status_line"; } my $response5 = $ua->get("http://www.google.com/search?q=$search&num=5 +0&hl=en&lr=&safe=off&start=300&sa=N"); if ($response5->is_success) { &parser(5); } else { print "$response5->status_line"; } my $response6 = $ua->get("http://www.google.com/search?q=$search&num=5 +0&hl=en&lr=&safe=off&start=350&sa=N"); if ($response6->is_success) { &parser(6); } else { print "$response6->status_line"; } my $response7 = $ua->get("http://www.google.com/search?q=$search&num=5 +0&hl=en&lr=&safe=off&start=400&sa=N"); if ($response7->is_success) { &parser(7); } else { print "$response7->status_line"; } my $response8 = $ua->get("http://www.google.com/search?q=$search&num=5 +0&hl=en&lr=&safe=off&start=450&sa=N"); if ($response8->is_success) { &parser(8); } else { print "$response8->status_line"; } my $response9 = $ua->get("http://www.google.com/search?q=$search&num=5 +0&hl=en&lr=&safe=off&start=500&sa=N"); if ($response9->is_success) { &parser(9); } else { print "$response9->status_line"; } my $response10 = $ua->get("http://www.google.com/search?q=$search&num= +50&hl=en&lr=&safe=off&start=550&sa=N"); if ($response10->is_success) { &parser(10); } else { print "$response10->status_line"; } my $response11 = $ua->get("http://www.google.com/search?q=$search&num= +50&hl=en&lr=&safe=off&start=600&sa=N"); if ($response11->is_success) { &parser(11); } else { print "$response11->status_line"; } my $response12 = $ua->get("http://www.google.com/search?q=$search&num= +50&hl=en&lr=&safe=off&start=650&sa=N"); if ($response12->is_success) { &parser(12); } else { print "$response12->status_line"; } my $response13 = $ua->get("http://www.google.com/search?q=$search&num= +50&hl=en&lr=&safe=off&start=700&sa=N"); if ($response13->is_success) { &parser(13); } else { print "$response13->status_line"; } my $response14 = $ua->get("http://www.google.com/search?q=$search&num= +50&hl=en&lr=&safe=off&start=750&sa=N"); if ($response14->is_success) { &parser(14); } else { print "$response14->status_line"; } my $response15 = $ua->get("http://www.google.com/search?q=$search&num= +50&hl=en&lr=&safe=off&start=800&sa=N"); if ($response15->is_success) { &parser(15); } else { print "$response15->status_line"; } my $response16 = $ua->get("http://www.google.com/search?q=$search&num= +50&hl=en&lr=&safe=off&start=950&sa=N"); if ($response16->is_success) { &parser(16); } else { print "$response16->status_line"; }
Basically we have separate $response for each offset as Google has upto 950 results per search query. We want to cut this down but I have no idea how.

In the &parser the number is SHIFTed and we have this

sub parser { my $count = shift; my $google_results; if ($count eq "0") {$google_results = $response->content;} elsif ($count eq "1") {$google_results = $response1->content;} elsif ($count eq "2") {$google_results = $response2->content;} elsif ($count eq "3") {$google_results = $response3->content;} elsif ($count eq "4") {$google_results = $response4->content;} elsif ($count eq "5") {$google_results = $response5->content;} elsif ($count eq "6") {$google_results = $response6->content;} elsif ($count eq "7") {$google_results = $response7->content;} elsif ($count eq "8") {$google_results = $response8->content;} elsif ($count eq "9") {$google_results = $response9->content;} elsif ($count eq "10") {$google_results = $response10->content;} elsif ($count eq "11") {$google_results = $response11->content;} elsif ($count eq "12") {$google_results = $response12->content;} elsif ($count eq "13") {$google_results = $response13->content;} elsif ($count eq "14") {$google_results = $response14->content;} elsif ($count eq "15") {$google_results = $response15->content;} elsif ($count eq "16") {$google_results = $response16->content;}
As you can see it DOES work but it's so hard to maintain. Any suggestions on how to make this code nicer?

Janitored by holli - added readmore-tag

Replies are listed 'Best First'.
Re: Making script more efficient
by dragonchild (Archbishop) on May 26, 2005 at 19:09 UTC
    my @urls = ( .... ); # <-- Put the URLs here foreach my $url (@urls) { my $response = $ua->get( $url ); unless ($response->is_success) { print $response->status_line, $/; next; } my $google_results = $response->content; # Do whatever else is in parser() here }

    • In general, if you think something isn't in Perl, try it out, because it usually is. :-)
    • "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"
      Thank you! That is so much cleaner and nicer than our original code.

      However, when I run the code now it errors out with "400 error URL must be absolute". Did I make a mistake with the URLS?

      print "Enter your search query: "; my $search = <STDIN>; chomp($search); my $google_results; my @urls = qq("http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=50&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=100&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=150&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=200&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=250&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=300&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=350&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=400&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=450&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=500&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=550&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=600&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=700&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=750&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=800&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=850&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=900&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=950&sa=N", ); foreach my $url (@urls) { my $response = $ua->get( $url ); unless ($response->is_success) { print $response->status_line, $/; next; } $google_results = $response->content; &parser; } sub parser { my @links_wanted; my @links_found; my $parser = HTML::TokeParser->new( \$google_results ); while ( my $token = $parser->get_tag( 'a' ) ) { my $url = $token->[ 1 ]{ href }; next unless $url =~ m{^https?://}; push @links_found, $url; }
        drop the qq from the @urls definition.

        You can also apply the same techniques to your list of URLs. Factor out the commonalities and program for the differences.


        • In general, if you think something isn't in Perl, try it out, because it usually is. :-)
        • "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"

        If it was me, I would generate the URLS with a loop too since, there is an awful lot of duplicated code.

        my @urls; my $search = 'whatever'; for my $index(0..19){ push @urls, 'http://www.google.com/search?q='.$search.'&num=50&hl= +en&lr=&safe=off&start='.($index * 50).'&sa=N'; }
Re: Making script more efficient
by Fletch (Bishop) on May 26, 2005 at 19:22 UTC

    Method calls don't interpolate inside ""s. Not to mention if you sign up for a developer token you're supposed to go through their SOAP API, not scrape the pages.

    --
    We're looking for people in ATL

Re: Making script more efficient
by mrborisguy (Hermit) on May 26, 2005 at 19:12 UTC

    Simple:

    my @addresses = ("...","..."); #list of addresses foreach my $addr ( @addresses ) { my $response = $ua->get( $addr ); if ( $response->is_success ) { $google_results = $reponse->content; # maybe parse these results here? parse( $google_results ); } else { print "$response->status_line"; } }

    Or something like that. Hopefully it will get you started

        -Bryan