I was happy to see our favorite language mentioned in a recent article disputing claims that Yahoo!'s index was apparently now double the size of Google's. Better yet was that they provided the code used to run the test. I didn't expect rocket science, they were simply running random queries at the two engines (basically doing the same as many scripts do to find a googlewhack).

I've got to say that although the code did apparently work properly, I was not all that impressed by the code that was used. Perhaps the author was not a native perl coder? I noticed a lot more duplication than I expected, and what I assume are leftover idioms from earlier perl days (srand calls) and some evil if statement logic that I can't explain away!

Either way, since this article is making the rounds, I thought some of my fellow monestarians may like to comment on the code.

ps - sorry if this is posted in the wrong place. Seemed a toss-up to me between here and Perl News

pps - Just noticed that there is another thread about this in CUFP here. Sorry for the dupe.

Moved from Meditations to Perl News by Arunbear.

Replies are listed 'Best First'.
Re: NCSA Uses Perl to Compare Google/Yahoo
by creamygoodness (Curate) on Aug 16, 2005 at 15:52 UTC
    Interesting stuff! The project directly violates the prominently placed "No Automated Querying" directive in Google's terms of service: http://www.google.com/intl/en/terms_of_service.html, and it seems unlikely that they made an exception:
    Please do not write to Google to request permission to "meta-search" Google for a research project, as such requests will not be granted.
    It also looks like the spider doesn't sleep between requests. Presumably the author made a considered choice that because Google and Yahoo are so robust, it was acceptable to fire off a huge number of requests all at once, in contravention of standard spidering netiquette (Google and Yahoo don't do that to your server). If you ever write a spider, please don't do that.
    Perhaps the author was not a native perl coder?

    sub main ? No hashes? Lots of subscripted array elements? Could it be... C?

    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com