Making script more efficient

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This is hard coded and hard to maintain and was wondering if anyone had suggestions on how to make the code a bit more condensed and less repititous.

And before some of you proclaim we're not allowed to scrape Google, you're wrong. If you sign up they allow X number of crawls per day and we're well within our limit on this one.

my $response = $ua->get("http://www.google.com/search?num=50&hl=en&lr=
+&safe=off&rls=GGLD%2CGGLD%3A2005-12%2CGGLD%3Aen&q=$search");

 if ($response->is_success) {
     &parser(0);
 }
 else {
     print "$response->status_line";
 }

my $response1 = $ua->get("http://www.google.com/search?q=$search&num=5
+0&hl=en&lr=&safe=off&start=100&sa=N");

 if ($response1->is_success) {
     &parser(1);
 }
 else {
     print "$response1->status_line";
 }

my $response2 = $ua->get("http://www.google.com/search?q=$search&num=5
+0&hl=en&lr=&safe=off&start=150&sa=N");

 if ($response2->is_success) {
     &parser(2);
 }
 else {
     print "$response2->status_line";
 }

my $response3 = $ua->get("http://www.google.com/search?q=$search&num=5
+0&hl=en&lr=&safe=off&start=200&sa=N");

 if ($response3->is_success) {
     &parser(3);
 }
 else {
     print "$response3->status_line";
 }

my $response4 = $ua->get("http://www.google.com/search?q=$search&num=5
+0&hl=en&lr=&safe=off&start=250&sa=N");

 if ($response4->is_success) {
     &parser(4);
 }
 else {
     print "$response4->status_line";
 }

my $response5 = $ua->get("http://www.google.com/search?q=$search&num=5
+0&hl=en&lr=&safe=off&start=300&sa=N");

 if ($response5->is_success) {
     &parser(5);
 }
 else {
     print "$response5->status_line";
 }

my $response6 = $ua->get("http://www.google.com/search?q=$search&num=5
+0&hl=en&lr=&safe=off&start=350&sa=N");

 if ($response6->is_success) {
     &parser(6);
 }
 else {
     print "$response6->status_line";
 }

my $response7 = $ua->get("http://www.google.com/search?q=$search&num=5
+0&hl=en&lr=&safe=off&start=400&sa=N");

 if ($response7->is_success) {
     &parser(7);
 }
 else {
     print "$response7->status_line";
 }

my $response8 = $ua->get("http://www.google.com/search?q=$search&num=5
+0&hl=en&lr=&safe=off&start=450&sa=N");

 if ($response8->is_success) {
     &parser(8);
 }
 else {
     print "$response8->status_line";
 }

my $response9 = $ua->get("http://www.google.com/search?q=$search&num=5
+0&hl=en&lr=&safe=off&start=500&sa=N");

 if ($response9->is_success) {
     &parser(9);
 }
 else {
     print "$response9->status_line";
 }

my $response10 = $ua->get("http://www.google.com/search?q=$search&num=
+50&hl=en&lr=&safe=off&start=550&sa=N");

 if ($response10->is_success) {
     &parser(10);
 }
 else {
     print "$response10->status_line";
 }

my $response11 = $ua->get("http://www.google.com/search?q=$search&num=
+50&hl=en&lr=&safe=off&start=600&sa=N");

 if ($response11->is_success) {
     &parser(11);
 }
 else {
     print "$response11->status_line";
 }

my $response12 = $ua->get("http://www.google.com/search?q=$search&num=
+50&hl=en&lr=&safe=off&start=650&sa=N");

 if ($response12->is_success) {
     &parser(12);
 }
 else {
     print "$response12->status_line";
 }

my $response13 = $ua->get("http://www.google.com/search?q=$search&num=
+50&hl=en&lr=&safe=off&start=700&sa=N");

 if ($response13->is_success) {
     &parser(13);
 }
 else {
     print "$response13->status_line";
 }

my $response14 = $ua->get("http://www.google.com/search?q=$search&num=
+50&hl=en&lr=&safe=off&start=750&sa=N");

 if ($response14->is_success) {
     &parser(14);
 }
 else {
     print "$response14->status_line";
 }


my $response15 = $ua->get("http://www.google.com/search?q=$search&num=
+50&hl=en&lr=&safe=off&start=800&sa=N");

 if ($response15->is_success) {
     &parser(15);
 }
 else {
     print "$response15->status_line";
 }

my $response16 = $ua->get("http://www.google.com/search?q=$search&num=
+50&hl=en&lr=&safe=off&start=950&sa=N");

 if ($response16->is_success) {
     &parser(16);
 }
 else {
     print "$response16->status_line";
 }
[download]

Basically we have separate $response for each offset as Google has upto 950 results per search query. We want to cut this down but I have no idea how.

In the &parser the number is SHIFTed and we have this

sub parser
{
   my $count = shift;

   my $google_results;

   if ($count eq "0")      {$google_results = $response->content;}
   elsif ($count eq "1")   {$google_results = $response1->content;}
   elsif ($count eq "2")   {$google_results = $response2->content;}
   elsif ($count eq "3")   {$google_results = $response3->content;}
   elsif ($count eq "4")   {$google_results = $response4->content;}
   elsif ($count eq "5")   {$google_results = $response5->content;}
   elsif ($count eq "6")   {$google_results = $response6->content;}
   elsif ($count eq "7")   {$google_results = $response7->content;}
   elsif ($count eq "8")   {$google_results = $response8->content;}
   elsif ($count eq "9")   {$google_results = $response9->content;}
   elsif ($count eq "10")  {$google_results = $response10->content;}
   elsif ($count eq "11")  {$google_results = $response11->content;}
   elsif ($count eq "12")  {$google_results = $response12->content;}
   elsif ($count eq "13")  {$google_results = $response13->content;}
   elsif ($count eq "14")  {$google_results = $response14->content;}
   elsif ($count eq "15")  {$google_results = $response15->content;}
   elsif ($count eq "16")  {$google_results = $response16->content;}
[download]

As you can see it DOES work but it's so hard to maintain. Any suggestions on how to make this code nicer?

Janitored by holli - added readmore-tag

Comment on Making script more efficient Select or Download Code

Replies are listed 'Best First'.
Re: Making script more efficient by dragonchild (Archbishop) on May 26, 2005 at 19:09 UTC
`my @urls = ( .... ); # <-- Put the URLs here foreach my $url (@urls) { my $response = $ua->get( $url ); unless ($response->is_success) { print $response->status_line, $/; next; } my $google_results = $response->content; # Do whatever else is in parser() here }` [download] In general, if you think something isn't in Perl, try it out, because it usually is. :-) "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"	[reply] [d/l]
Re^2: Making script more efficient by Anonymous Monk on May 26, 2005 at 19:31 UTC
Thank you! That is so much cleaner and nicer than our original code. However, when I run the code now it errors out with "400 error URL must be absolute". Did I make a mistake with the URLS? print "Enter your search query: "; my $search = <STDIN>; chomp($search); my $google_results; my @urls = qq("http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=50&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=100&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=150&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=200&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=250&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=300&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=350&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=400&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=450&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=500&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=550&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=600&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=700&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=750&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=800&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=850&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=900&sa=N", "http://www.google.com/search?q=$search&num=50&hl=en&lr= +&safe=off&start=950&sa=N", ); foreach my $url (@urls) { my $response = $ua->get( $url ); unless ($response->is_success) { print $response->status_line, $/; next; } $google_results = $response->content; &parser; } sub parser { my @links_wanted; my @links_found; my $parser = HTML::TokeParser->new( \$google_results ); while ( my $token = $parser->get_tag( 'a' ) ) { my $url = $token->[ 1 ]{ href }; next unless $url =~ m{^https?://}; push @links_found, $url; } [download]	[reply] [d/l]
Re^3: Making script more efficient by dragonchild (Archbishop) on May 26, 2005 at 19:35 UTC
drop the qq from the @urls definition. You can also apply the same techniques to your list of URLs. Factor out the commonalities and program for the differences. In general, if you think something isn't in Perl, try it out, because it usually is. :-) "What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?"	[reply]
Re^3: Making script more efficient by thundergnat (Deacon) on May 26, 2005 at 21:09 UTC
If it was me, I would generate the URLS with a loop too since, there is an awful lot of duplicated code. `my @urls; my $search = 'whatever'; for my $index(0..19){ push @urls, 'http://www.google.com/search?q='.$search.'&num=50&hl= +en&lr=&safe=off&start='.($index * 50).'&sa=N'; }` [download]	[reply] [d/l]
Re: Making script more efficient by Fletch (Bishop) on May 26, 2005 at 19:22 UTC
Method calls don't interpolate inside ""s. Not to mention if you sign up for a developer token you're supposed to go through their SOAP API, not scrape the pages. -- We're looking for people in ATL	[reply]
Re: Making script more efficient by mrborisguy (Hermit) on May 26, 2005 at 19:12 UTC
Simple: `my @addresses = ("...","..."); #list of addresses foreach my $addr ( @addresses ) { my $response = $ua->get( $addr ); if ( $response->is_success ) { $google_results = $reponse->content; # maybe parse these results here? parse( $google_results ); } else { print "$response->status_line"; } }` [download] Or something like that. Hopefully it will get you started -Bryan	[reply] [d/l]