...How to parse search engine results fast?

A200560 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: ...How to parse search engine results fast? by Fletch (Bishop) on Feb 03, 2005 at 15:53 UTC
You might get better results from Google using their API (not to mention that scraping them is against their Terms of Service . . .).	[reply]
Re^2: ...How to parse search engine results fast? by A200560 (Novice) on Feb 03, 2005 at 15:58 UTC
thanks, but my question have an architectural flavor... Do you have some idea?	[reply]
Re^3: ...How to parse search engine results fast? by Fletch (Bishop) on Feb 03, 2005 at 16:46 UTC
I have an idea that the best architecture in the world will process 0 results a second if the site won't return any because the querying IP or network's been blocked because it's not following the site's rules (and doing that for a major search engine is sure to get you in on your hosting provider's or ISP's good side).	[reply]
Re^3: ...How to parse search engine results fast? by Grygonos (Chaplain) on Feb 03, 2005 at 16:28 UTC
I would familiarize yourself with the robots.txt document for each site you are scraping. Google's robots.txt Grygonos	[reply]
Re: ...How to parse search engine results fast? by hardburn (Abbot) on Feb 03, 2005 at 15:48 UTC
The other metasearch engines may have bigger hardware and a lot more bandwidth than you do. They might also cache results. Web scraping is a straightforward task, so I doubt they're doing anything inheirtantly faster than you are (except, maybe, using a faster HTML parser). "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.	[reply]
Re^2: ...How to parse search engine results fast? by A200560 (Novice) on Feb 03, 2005 at 15:53 UTC
Ciao, I don't think so, for example why dogPile download 200 results from 4 different search engines in 1 second and with my P4 1.7 GHZ, 512 MB RAM, 100Mbit (totally free machine) I download from google top 100 in 1,5 sec? Hardware matter in a high load envirnment...	[reply]
Re^3: ...How to parse search engine results fast? by saskaqueer (Friar) on Feb 04, 2005 at 02:53 UTC
Obviously, dogPile doesn't submit a search to each of the search engines every time you enter something into dogPile. For one, that would be very foolish for speed (as your problem is showing). Surely dogPile saves the results it fetches from the engines and reuses that the next time someone else queries for the same search. So for example, you search for 'hello world'. dogPile sees that these search terms haven't been fetched before, so dogPile queries all the search engines. Next time someone searches for 'hello world', dogPile doesn't need to refetch the search results since it cached them on its own servers.	[reply]
Re: ...How to parse search engine results fast? by inman (Curate) on Feb 03, 2005 at 18:04 UTC
The following example gets information from three sources: Google, MSN and Yahoo!. You would need to create a custom parser for each engine. You may wish to look at HTML::Parser for this. #! /usr/bin/perl -w use strict; use warnings; use LWP; use threads; use Thread::Queue; my $query ="perl"; my $dataQueue = Thread::Queue->new; my $threadCount = 0; while (<DATA>) { chomp; s/^\s+//; s/\s+$//; my ($engine, $url) = split /\s+/; next unless $url; $url.=$query; print "$url\n"; my $thr = threads->new(\&doSearch, $engine, $url); $thr->detach; $threadCount ++; } while ($threadCount) { my $engine = $dataQueue->dequeue; my $content = $dataQueue->dequeue; print "$engine returned: $content\n"; $threadCount --; } print "Parse and return remaining content\n"; sub doSearch { my $engine = shift; my $url = shift; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/5.0'); $ua->timeout(10); $ua->env_proxy; my $response = $ua->get($url); if ($response->is_success) { $dataQueue->enqueue($engine, $response->content); } else { $dataQueue->enqueue($engine, $response->message); } } __DATA__ Google http://www.google.com/search?q= Yahoo! http://search.yahoo.com/search?p= MSN http://beta.search.msn.co.uk/results.aspx?q= [download]	[reply] [d/l]
Re^2: ...How to parse search engine results fast? by A200560 (Novice) on Feb 03, 2005 at 18:13 UTC
Do you think that managing 3-4 requests to different search engines with LWP::Parallel can give me some benefits in speed? V.B.	[reply]
Re^2: ...How to parse search engine results fast? by tphyahoo (Vicar) on Mar 03, 2005 at 16:46 UTC
If I comment out the print line in `while ($threadCount) { my $engine = $dataQueue->dequeue; my $content = $dataQueue->dequeue; #print "$engine returned: $content\n"; $threadCount --; }` [download] I frequently get error (warning?) "A thread exited while two threads were running". I am a thread newbie and don't know why this is happening, nor how "bad" this is, or if it's bad at all. You may want to check back at What is the fastest way to download a bunch of web pages? where BrowserUK does something similar which doesn't give this warning. At least, not yet. At any rate, thanks for giving me something to get my fingers dirty with in thread world.	[reply] [d/l]
Re^2: ...How to parse search engine results fast? by A200560 (Novice) on Jan 11, 2006 at 16:43 UTC
inman, can you send me a your private mail? Thanks.	[reply]