A200560 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm building a metaseach engine based on data mining techniques....but this is not important...

My question is about performances of the activity of scraping search engine results from an HTML response page.

I see that some metasearch engines (Mamma, DogPile, Vivisimo & C.) present top 50 results of 3-5 search engines in about 1 seconds.

With my perl script I am able to retrieve top 100 results of Google in about 1,5 seconds, but from only one search engine!

Somebody (very much skilled in Perl) can tell me some advanced technique (parallelism, thread...bo?) to retrieve from 3-5 search engines very fast?


Excuse me for my english (I'm italian) and for my poor Perl skills.

Thanks,

VB
  • Comment on ...How to parse search engine results fast?

Replies are listed 'Best First'.
Re: ...How to parse search engine results fast?
by Fletch (Bishop) on Feb 03, 2005 at 15:53 UTC

    You might get better results from Google using their API (not to mention that scraping them is against their Terms of Service . . .).

      thanks,


      but my question have an architectural flavor...



      Do you have some idea?

        I have an idea that the best architecture in the world will process 0 results a second if the site won't return any because the querying IP or network's been blocked because it's not following the site's rules (and doing that for a major search engine is sure to get you in on your hosting provider's or ISP's good side).

Re: ...How to parse search engine results fast?
by hardburn (Abbot) on Feb 03, 2005 at 15:48 UTC

    The other metasearch engines may have bigger hardware and a lot more bandwidth than you do. They might also cache results. Web scraping is a straightforward task, so I doubt they're doing anything inheirtantly faster than you are (except, maybe, using a faster HTML parser).

    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

      Ciao, I don't think so, for example why dogPile download 200 results from 4 different search engines in 1 second and with my P4 1.7 GHZ, 512 MB RAM, 100Mbit (totally free machine) I download from google top 100 in 1,5 sec?


      Hardware matter in a high load envirnment...

        Obviously, dogPile doesn't submit a search to each of the search engines every time you enter something into dogPile. For one, that would be very foolish for speed (as your problem is showing). Surely dogPile saves the results it fetches from the engines and reuses that the next time someone else queries for the same search.

        So for example, you search for 'hello world'. dogPile sees that these search terms haven't been fetched before, so dogPile queries all the search engines. Next time someone searches for 'hello world', dogPile doesn't need to refetch the search results since it cached them on its own servers.

Re: ...How to parse search engine results fast?
by inman (Curate) on Feb 03, 2005 at 18:04 UTC
    The following example gets information from three sources: Google, MSN and Yahoo!. You would need to create a custom parser for each engine. You may wish to look at HTML::Parser for this.

    #! /usr/bin/perl -w use strict; use warnings; use LWP; use threads; use Thread::Queue; my $query ="perl"; my $dataQueue = Thread::Queue->new; my $threadCount = 0; while (<DATA>) { chomp; s/^\s+//; s/\s+$//; my ($engine, $url) = split /\s+/; next unless $url; $url.=$query; print "$url\n"; my $thr = threads->new(\&doSearch, $engine, $url); $thr->detach; $threadCount ++; } while ($threadCount) { my $engine = $dataQueue->dequeue; my $content = $dataQueue->dequeue; print "$engine returned: $content\n"; $threadCount --; } print "Parse and return remaining content\n"; sub doSearch { my $engine = shift; my $url = shift; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/5.0'); $ua->timeout(10); $ua->env_proxy; my $response = $ua->get($url); if ($response->is_success) { $dataQueue->enqueue($engine, $response->content); } else { $dataQueue->enqueue($engine, $response->message); } } __DATA__ Google http://www.google.com/search?q= Yahoo! http://search.yahoo.com/search?p= MSN http://beta.search.msn.co.uk/results.aspx?q=
      Do you think that managing 3-4 requests to different search engines with LWP::Parallel can give me some benefits in speed?


      V.B.
      If I comment out the print line in
      while ($threadCount) { my $engine = $dataQueue->dequeue; my $content = $dataQueue->dequeue; #print "$engine returned: $content\n"; $threadCount --; }
      I frequently get error (warning?) "A thread exited while two threads were running".

      I am a thread newbie and don't know why this is happening, nor how "bad" this is, or if it's bad at all.

      You may want to check back at What is the fastest way to download a bunch of web pages? where BrowserUK does something similar which doesn't give this warning. At least, not yet.

      At any rate, thanks for giving me something to get my fingers dirty with in thread world.

      inman, can you send me a your private mail? Thanks.