Speeding up HTML parsing

sulfericacid has asked for the wisdom of the Perl Monks concerning the following question:

I'm attempting to make a link popularity checker in CGI (most will be tested and programmed for the command line until it's ready-- to remove unneeded hassle) and for something of this magnitude, speed is definitely a concern. www.marketleap.com has a similar thing to what I want to create and it takes like 10 seconds top to parse all the URLs and give you the report, it takes almost 10 seconds for 1 or 2 engines for my script. Can someone give me tips on how to severely increase the speed on this using modules that come with perl?

Code so far:

#!/usr/bin/perl

use LWP::Simple;
use strict;

$|=1;

my $url = "http://sulfericacid.perlmonk.org";


my $altavista = "http://www.altavista.com/web/results?q=link:$url&kl=X
+X&search=Search";
my $google    = "http://www.google.com/search?hl=en&lr=&ie=ISO-8859-1&
+q=link%3A$url&btnG=Google+Search";





########################
# Altavista!
########################
my $altavista_content = get("$altavista");
my @altavista_lines = split /\n/, $altavista_content;

my $altavista_results;

foreach my $altavista_line (@altavista_lines)
{
$altavista_results = $1 if $altavista_line =~ m/AltaVista found (.*) r
+esults/; 
}


print "searched: $altavista\n";
print "results: $altavista_results\n";


########################
# Google!
########################
my $google_content = get("$google",  'User-Agent' => 'Mozilla/4.76 [en
+] (win-98; U)');

my @google_lines = split /\n/, $google_content;

my $google_results;
my $hits;

foreach my $google_line (@google_lines)
{
if ($google_line =~ /Results <b>\d+<\/b> - <b>\d+<\/b> of about <b>((\
+d{1,3}\,?)+)<\/b>/g)
{
$hits = $1;
}}
#Results <b>1</b> - <b>1</b> of <b>1</b>.

print "searched: $google\n";
print "results: $google_results $hits\n";
[download]

"Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

sulfericacid

Comment on Speeding up HTML parsing Download Code

Replies are listed 'Best First'.
Re: Speeding up HTML parsing by biosysadmin (Deacon) on Apr 21, 2004 at 02:45 UTC
Right now you're getting the results from each engine serially, you could do this faster by using Parallel::ForkManager. It's a very handy module, I wrote a tutorial on it located here if you're interested. :D Another thing that might help is abstracting the code from each engine-specific section into a subroutine. This way you can debug that code, compartmentalize it, and have a more readable flow of execution in your main program. Also, if you ever want to reuse the code that you wrote to get and parse the website results, moving that code into subroutines is the first step. Here's some example code for you, it abstracts the Altavista search method into it's own subroutine and also uses Parallel::ForkManager to speed things up: #!/usr/bin/perl use LWP::Simple; use Parallel::ForkManager; use strict; $\|=1; my @urls = qw( http://sulfericacid.perlmonk.org http://sulfericacid.com ); my $number_of_forks = scalar @urls; my $forkmanager = Parallel::ForkManager->new( $number_of_forks ); foreach my $site ( @urls ) { $forkmanager->start and next; my $altavista_results = &altavista_search( $site ); print "Searched http://www.altavista.com for site $site\n"; print "results: $altavista_results\n"; $forkmanager->finish; } $forkmanager->wait_all_children; ####################### # Altavista! ####################### sub altavista_search { my $url = shift; my $engine_link = "http://www.altavista.com/web/results?q=link:$url&kl=XX&search=S +earch"; my $content = get("$engine_link"); my @lines = split /\n/, $content; my $results; foreach my $line (@lines) { $results = $1 if $line =~ m/AltaVista found (.*) results/; } return $results; } [download]	[reply] [d/l]
Re: Speeding up HTML parsing by TilRMan (Friar) on Apr 21, 2004 at 03:13 UTC
Some possibly mutually exclusive suggestions. You're not parsing HTML at all. That's good, because that would just slow you down. Try with `$\|` off. I don't expect it'll make a difference either way. Notice that `$\| = 1` won't get the data out of Google any faster, so maybe it doesn't do quite what you think. When you find the matching line, `last` out of the loop. Don't split your big string at all. Just run the regexp against it. Use something more sophisticated than LWP::Simple that will hand you back the stream coming in from the server. Parse the incoming stream and drop it when you find the magic line. Or roll your own HTTP request with IO::Socket::INET. Parallelization is your friend, but you can't be faster than the slowest engine. Get benchmarks and attack the slow ones first.	[reply] [d/l] [select]
Re: Speeding up HTML parsing by asdfgroup (Beadle) on Apr 21, 2004 at 10:29 UTC
You can speed up your task by : optimizing parsing, optimizing url fetching and finally, by fetching several urls in parallel. If you are looking at your code, you realize most time is spent in blocking get (more precisiously, in slow system call read) So first (and I think last :) ) point to optimize your script - make url fetching in parallel So let's observe several ways to do this and point advantages and disadvantages : - using some fork solution. You fork process and download each URL in own process. Excellent example was shown before : Parallel::ForkManager Advantages : straigt-forward lazy solution. Will work fine for you now Disadvantages : it will take additional system resources. And you will have to realize some IPC between processes (not so trivial task !) if (when) you wonna to combine results together threads : Way how this task usually do in C. We make sepearate threads for every URL. So each LWP get will be done in own thread simult with others Advantages : Easy to implement IPC. Standart solution Disadvantages : iThreads require many system resources and not very fast :( - using non-blocking sockets : instead of block if no data ready in socket, read will return immidiately return EINPROGRESS error. So you can make some evnet loop thru set of open sockets : Can we read from this socket ? Read and parse : Try next socket. Check Parallel User Agent Advantages : Fast. Save resources. Disadvantages : A bit complicated in programming	[reply]
Re: Speeding up HTML parsing by Fletch (Bishop) on Apr 21, 2004 at 03:10 UTC
Not an answer, but keep in mind that what you're doing is technically in violation of Google's Terms of Service.	[reply]
Re: Re: Speeding up HTML parsing by sulfericacid (Deacon) on Apr 21, 2004 at 03:13 UTC
What I am doing really isn't against their Terms of Service in the least, I'm just not doing it the right way yet. It will not let me parse them until I setup the DPI for it, but that's another story. I'm more concerned on how to increase the speed so I can begin adding the other ~8+ engines to the list. "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us" sulfericacid	[reply]
Re: Re: Re: Speeding up HTML parsing by Fletch (Bishop) on Apr 21, 2004 at 12:05 UTC
No Automated Querying You may not send automated queries of any sort to Google's system without express permission in advance from Google. You are sending automated queries to Google, by the letter of their TOS that's forbidden. You may get away with it, you may get your IP (or worse your ISP's netblock) blocked by Google. Simply pointing out that Google technically forbids any non-browser access not through their official API. Go right ahead and live on the edge (as long as you're not on my netblock :).	[reply]
Re: Speeding up HTML parsing by asdfgroup (Beadle) on Apr 21, 2004 at 10:41 UTC
And some code for PUA realization :) Skeleton code here : `use LWP::Parallel; use HTTP::Request; my %Engines = ) google => { url=>'qqq', parse => sub {},}, av => { url=>'zzz', parse => sub {},}, ) my $pua = LWP::Parallel::UserAgent->new(); $pua->register($_->{url}, $_->{parse}) for values %Engines; $pua->wait(); # at this point you have all pages fetched and filtered thru parse sub +s` [download]	[reply] [d/l]