Re: ...How to parse search engine results fast?
by Fletch (Bishop) on Feb 03, 2005 at 15:53 UTC
|
You might get better results from Google using their API (not to mention that scraping them is against their Terms of Service . . .).
| [reply] |
|
|
thanks,
but my question have an architectural flavor...
Do you have some idea?
| [reply] |
|
|
| [reply] |
|
|
| [reply] |
Re: ...How to parse search engine results fast?
by hardburn (Abbot) on Feb 03, 2005 at 15:48 UTC
|
The other metasearch engines may have bigger hardware and a lot more bandwidth than you do. They might also cache results. Web scraping is a straightforward task, so I doubt they're doing anything inheirtantly faster than you are (except, maybe, using a faster HTML parser).
"There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.
| [reply] |
|
|
Ciao,
I don't think so, for example why dogPile download 200 results from 4 different search engines in 1 second and with my P4 1.7 GHZ, 512 MB RAM, 100Mbit (totally free machine) I download from google top 100 in 1,5 sec?
Hardware matter in a high load envirnment...
| [reply] |
|
|
Obviously, dogPile doesn't submit a search to each of the search engines every time you enter something into dogPile. For one, that would be very foolish for speed (as your problem is showing). Surely dogPile saves the results it fetches from the engines and reuses that the next time someone else queries for the same search.
So for example, you search for 'hello world'. dogPile sees that these search terms haven't been fetched before, so dogPile queries all the search engines. Next time someone searches for 'hello world', dogPile doesn't need to refetch the search results since it cached them on its own servers.
| [reply] |
Re: ...How to parse search engine results fast?
by inman (Curate) on Feb 03, 2005 at 18:04 UTC
|
The following example gets information from three sources: Google, MSN and Yahoo!. You would need to create a custom parser for each engine. You may wish to look at HTML::Parser for this.
#! /usr/bin/perl -w
use strict;
use warnings;
use LWP;
use threads;
use Thread::Queue;
my $query ="perl";
my $dataQueue = Thread::Queue->new;
my $threadCount = 0;
while (<DATA>)
{
chomp; s/^\s+//; s/\s+$//;
my ($engine, $url) = split /\s+/;
next unless $url;
$url.=$query;
print "$url\n";
my $thr = threads->new(\&doSearch, $engine, $url);
$thr->detach;
$threadCount ++;
}
while ($threadCount)
{
my $engine = $dataQueue->dequeue;
my $content = $dataQueue->dequeue;
print "$engine returned: $content\n";
$threadCount --;
}
print "Parse and return remaining content\n";
sub doSearch
{
my $engine = shift;
my $url = shift;
my $ua = LWP::UserAgent->new;
$ua->agent('Mozilla/5.0');
$ua->timeout(10);
$ua->env_proxy;
my $response = $ua->get($url);
if ($response->is_success) {
$dataQueue->enqueue($engine, $response->content);
}
else {
$dataQueue->enqueue($engine, $response->message);
}
}
__DATA__
Google http://www.google.com/search?q=
Yahoo! http://search.yahoo.com/search?p=
MSN http://beta.search.msn.co.uk/results.aspx?q=
| [reply] [d/l] |
|
|
Do you think that managing 3-4 requests to different search engines with LWP::Parallel can give me some benefits in speed?
V.B.
| [reply] |
|
|
If I comment out the print line in
while ($threadCount)
{
my $engine = $dataQueue->dequeue;
my $content = $dataQueue->dequeue;
#print "$engine returned: $content\n";
$threadCount --;
}
I frequently get error (warning?) "A thread exited while two threads were running".
I am a thread newbie and don't know why this is happening, nor how "bad" this is, or if it's bad at all.
You may want to check back at What is the fastest way to download a bunch of web pages? where BrowserUK does something similar which doesn't give this warning. At least, not yet.
At any rate, thanks for giving me something to get my fingers dirty with in thread world. | [reply] [d/l] |
|
|
inman, can you send me a your private mail? Thanks.
| [reply] |