Re: Speeding up HTML parsing

Right now you're getting the results from each engine serially, you could do this faster by using Parallel::ForkManager. It's a very handy module, I wrote a tutorial on it located here if you're interested. :D

Another thing that might help is abstracting the code from each engine-specific section into a subroutine. This way you can debug that code, compartmentalize it, and have a more readable flow of execution in your main program. Also, if you ever want to reuse the code that you wrote to get and parse the website results, moving that code into subroutines is the first step.

Here's some example code for you, it abstracts the Altavista search method into it's own subroutine and also uses Parallel::ForkManager to speed things up:

#!/usr/bin/perl

use LWP::Simple;
use Parallel::ForkManager;
use strict;

$|=1;

my @urls = qw(
   http://sulfericacid.perlmonk.org
   http://sulfericacid.com
);

my $number_of_forks = scalar @urls;
my $forkmanager = Parallel::ForkManager->new( $number_of_forks );

foreach my $site ( @urls ) {
   $forkmanager->start and next;
   my $altavista_results = &altavista_search( $site );
   print "Searched http://www.altavista.com for site $site\n";
   print "results: $altavista_results\n";
   $forkmanager->finish;
}

$forkmanager->wait_all_children;


#######################
# Altavista!
#######################

sub altavista_search {
   my $url = shift;
   my $engine_link = 
      "http://www.altavista.com/web/results?q=link:$url&kl=XX&search=S
+earch";

   my $content = get("$engine_link");
   my @lines = split /\n/, $content;
   
   my $results;
   
   foreach my $line (@lines) {
      $results = $1 if $line =~ m/AltaVista found (.*) results/; 
   }
   return $results;
}
[download]

Comment on Re: Speeding up HTML parsing Download Code