sulfericacid has asked for the wisdom of the Perl Monks concerning the following question:

I'm attempting to make a link popularity checker in CGI (most will be tested and programmed for the command line until it's ready-- to remove unneeded hassle) and for something of this magnitude, speed is definitely a concern. www.marketleap.com has a similar thing to what I want to create and it takes like 10 seconds top to parse all the URLs and give you the report, it takes almost 10 seconds for 1 or 2 engines for my script. Can someone give me tips on how to severely increase the speed on this using modules that come with perl?

Code so far:

#!/usr/bin/perl use LWP::Simple; use strict; $|=1; my $url = "http://sulfericacid.perlmonk.org"; my $altavista = "http://www.altavista.com/web/results?q=link:$url&kl=X +X&search=Search"; my $google = "http://www.google.com/search?hl=en&lr=&ie=ISO-8859-1& +q=link%3A$url&btnG=Google+Search"; ######################## # Altavista! ######################## my $altavista_content = get("$altavista"); my @altavista_lines = split /\n/, $altavista_content; my $altavista_results; foreach my $altavista_line (@altavista_lines) { $altavista_results = $1 if $altavista_line =~ m/AltaVista found (.*) r +esults/; } print "searched: $altavista\n"; print "results: $altavista_results\n"; ######################## # Google! ######################## my $google_content = get("$google", 'User-Agent' => 'Mozilla/4.76 [en +] (win-98; U)'); my @google_lines = split /\n/, $google_content; my $google_results; my $hits; foreach my $google_line (@google_lines) { if ($google_line =~ /Results <b>\d+<\/b> - <b>\d+<\/b> of about <b>((\ +d{1,3}\,?)+)<\/b>/g) { $hits = $1; }} #Results <b>1</b> - <b>1</b> of <b>1</b>. print "searched: $google\n"; print "results: $google_results $hits\n";


"Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

sulfericacid

Replies are listed 'Best First'.
Re: Speeding up HTML parsing
by biosysadmin (Deacon) on Apr 21, 2004 at 02:45 UTC
    Right now you're getting the results from each engine serially, you could do this faster by using Parallel::ForkManager. It's a very handy module, I wrote a tutorial on it located here if you're interested. :D

    Another thing that might help is abstracting the code from each engine-specific section into a subroutine. This way you can debug that code, compartmentalize it, and have a more readable flow of execution in your main program. Also, if you ever want to reuse the code that you wrote to get and parse the website results, moving that code into subroutines is the first step.

    Here's some example code for you, it abstracts the Altavista search method into it's own subroutine and also uses Parallel::ForkManager to speed things up:

    #!/usr/bin/perl use LWP::Simple; use Parallel::ForkManager; use strict; $|=1; my @urls = qw( http://sulfericacid.perlmonk.org http://sulfericacid.com ); my $number_of_forks = scalar @urls; my $forkmanager = Parallel::ForkManager->new( $number_of_forks ); foreach my $site ( @urls ) { $forkmanager->start and next; my $altavista_results = &altavista_search( $site ); print "Searched http://www.altavista.com for site $site\n"; print "results: $altavista_results\n"; $forkmanager->finish; } $forkmanager->wait_all_children; ####################### # Altavista! ####################### sub altavista_search { my $url = shift; my $engine_link = "http://www.altavista.com/web/results?q=link:$url&kl=XX&search=S +earch"; my $content = get("$engine_link"); my @lines = split /\n/, $content; my $results; foreach my $line (@lines) { $results = $1 if $line =~ m/AltaVista found (.*) results/; } return $results; }
Re: Speeding up HTML parsing
by TilRMan (Friar) on Apr 21, 2004 at 03:13 UTC

    Some possibly mutually exclusive suggestions.

    • You're not parsing HTML at all. That's good, because that would just slow you down.
    • Try with $| off. I don't expect it'll make a difference either way. Notice that $| = 1 won't get the data out of Google any faster, so maybe it doesn't do quite what you think.
    • When you find the matching line, last out of the loop.
    • Don't split your big string at all. Just run the regexp against it.
    • Use something more sophisticated than LWP::Simple that will hand you back the stream coming in from the server. Parse the incoming stream and drop it when you find the magic line. Or roll your own HTTP request with IO::Socket::INET.
    • Parallelization is your friend, but you can't be faster than the slowest engine. Get benchmarks and attack the slow ones first.
Re: Speeding up HTML parsing
by asdfgroup (Beadle) on Apr 21, 2004 at 10:29 UTC
    You can speed up your task by : optimizing parsing, optimizing url fetching and finally, by fetching several urls in parallel.
    If you are looking at your code, you realize most time is spent in blocking get (more precisiously, in slow system call read) So first (and I think last :) ) point to optimize your script - make url fetching in parallel

    So let's observe several ways to do this and point advantages and disadvantages :

    - using some fork solution. You fork process and download each URL in own process. Excellent example was shown before : Parallel::ForkManager
    Advantages : straigt-forward lazy solution. Will work fine for you now
    Disadvantages : it will take additional system resources. And you will have to realize some IPC between processes (not so trivial task !) if (when) you wonna to combine results together

    threads : Way how this task usually do in C. We make sepearate threads for every URL. So each LWP get will be done in own thread simult with others
    Advantages : Easy to implement IPC. Standart solution
    Disadvantages : iThreads require many system resources and not very fast :(

    - using non-blocking sockets : instead of block if no data ready in socket, read will return immidiately return EINPROGRESS error. So you can make some evnet loop thru set of open sockets : Can we read from this socket ? Read and parse : Try next socket. Check Parallel User Agent
    Advantages : Fast. Save resources.
    Disadvantages : A bit complicated in programming
Re: Speeding up HTML parsing
by Fletch (Bishop) on Apr 21, 2004 at 03:10 UTC
      What I am doing really isn't against their Terms of Service in the least, I'm just not doing it the right way yet. It will not let me parse them until I setup the DPI for it, but that's another story. I'm more concerned on how to increase the speed so I can begin adding the other ~8+ engines to the list.


      "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

      sulfericacid
        No Automated Querying
        You may not send automated queries of any sort to Google's system without express permission in advance from Google.

        You are sending automated queries to Google, by the letter of their TOS that's forbidden. You may get away with it, you may get your IP (or worse your ISP's netblock) blocked by Google. Simply pointing out that Google technically forbids any non-browser access not through their official API. Go right ahead and live on the edge (as long as you're not on my netblock :).

Re: Speeding up HTML parsing
by asdfgroup (Beadle) on Apr 21, 2004 at 10:41 UTC
    And some code for PUA realization :)
    Skeleton code here :
    use LWP::Parallel; use HTTP::Request; my %Engines = ) google => { url=>'qqq', parse => sub {},}, av => { url=>'zzz', parse => sub {},}, ) my $pua = LWP::Parallel::UserAgent->new(); $pua->register($_->{url}, $_->{parse}) for values %Engines; $pua->wait(); # at this point you have all pages fetched and filtered thru parse sub +s