Google has issued a programming contest to create either an application or a system that can improve their services.
A prize of $10,000 plus a VIP visit to Google HQ is at stake.
The contest does not mention Perl, but OTOH it does not forbid it either. Although the ultimate goal should be to improve efficiency, they keep an open attitude regarding how such a benefit can be achieved (making a better engine, changing data structure, compressing in a different way and so on).
Contestants can download sample data and programs (57 MB) that should be intesting to look at. It includes data from 16000 web pages and a link to a site from where you can download the full set if you feel like it.
I think that some monks might consider the contest worth a glimpse. At the very least, you can get an idea of what is under the hood.

update Now somebody confirmed that Perl is not accepted in the contest (thanks, chaoticset) but anyway it was fun looking at the code! :)
_ _ _ _ (_|| | |(_|>< _|

Replies are listed 'Best First'.
Re: (OT) Google programming contest
by Masem (Monsignor) on Feb 07, 2002 at 14:32 UTC
    I looked at those rules yesterday, and unfortunately, I think it does restrict solutions to either be C++ or Java. I quote from the rules:
    Your submission must include a Makefile and README, and must compile on Linux 2.2 or 2.4 using g++ (for C++ code) or standard Sun tools (for Java code).
    While it does go on to say that the code cannot rely on anything else that isn't open source or GPL'd, I believe they want only compiled programs and not interpreted ones.

    However, that's my reading; there may be more clarifications.

    -----------------------------------------------------
    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
    "I can see my house from here!"
    It's not what you know, but knowing how to find it if you don't know that's important

      I read it as permitting Perl. You have to link to their C++ code (easy enough with Perl) and provide the download and install instructions for any third-party open source tools (Perl!).

      -- Randal L. Schwartz, Perl hacker

      Here's the bad news.

      As pudge said on use.perl;, "Wow. That's retarded."

      -----------------------
      You are what you think.

        It's sad, but I don't think the decision is retarded. I am sure Google staff use Perl day in, day out, and the decision to not include it was probably bitterly fought.

        Now consider a section from the Google page in question, and I quote:

        Your mission is to write a program (most likely by adding code to the ripper) that does something interesting with the data, in such a way that it would scale to a web-sized collection of documents

        As much as it would be fun to hack something up with Perl to munge a 57Mb sample set, I personally wouldn't want to have to wait for the run results on a web-sized collection. If 16 000 pages results in 57Mb, then Google's current collection of 2 073 000 000 pages would mean a 7 380 000Mb data set, that's 7 terabytes! That's several orders of magnitude for you to shoot yourself in the foot with if you miss a whisker of performance.

        At this end of the spectrum, you have to pay careful attention to details, and I think Perl would just generate too much overhead to scale up.

        But then again, wouldn't Java? If you're not careful you could drown in an ocean of objects, madly being garbage collected. Think what that would do to your performance. (I recall an article (/.? K5?) that discussed the use of lisp in planning air trips on the Sabre system (or analog). They have to pay careful attention in how they code in order to not produce garbage -- too bad I can't find the link).

        It's clear that Perl would have allowed some really nifty prototypes coded with a minimum of fuss, that the Google crew could have picked up and run with, recasting them in C++, if that is indeed the whole point of the contest. And that is what sucks.

        Hey, I'm sure Python and Ruby could too.

        But instead they chose Java. In that case one can only conclude that they forgot Public Enemy's number one rule, "Don't believe the hype."

        --
        g r i n d e r
        print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u';
How Does Google Search Work? - Re: (OT) Google programming contest
by metadoktor (Hermit) on Feb 08, 2002 at 00:00 UTC
    Actually the program source code and a sample dataset only come in at under 300k. It's the full blown dataset that comes in at 57MB, I believe and supposedly they can mail that to you sometime during February on five CDs. I downloaded the small source code GZIP'd tar file. It has a few C++ programs in it. Things seem to be Object Oriented but I didn't look at it for very long.

    The sample projects that they list on the page don't seem to be very interesting though.

    I'm looking forward to the much touted new feature called Recency. At least in the near future you'll know how old the link is.

    As for improvements with Google itself, I think, the only obvious thing is to continue to improve their search engine. I still have trouble with searching for something and getting a googol of hits. I don't want a googol of hits. I only want a few that match my specific query. Sometimes I resort to jumping to the 12th hit page because I used to find that on other search engines like HotBot or Altavista that that was were my true hits were located. They weren't at top they were hidden underneath a lot of other garbage. Putting in more words with Google doesn't work too well either because then it comes back and says that it didn't find anything matching your criteria.

    I wonder how they search their pages. I've always assumed that search engines used some sort of pre-search where they searched their pages for common words that were not the, a, of, or and to name a few articles/conjunctions and indexed them in some manner so that the word "Banana" might refer to 12,330 web pages.

    Here is an excellent article by Paul Boutin @ Webmonkey.com that sheds some light on how Google works. Btw, I seem to recall that he was invited by Google to tour their facility sometime after this article was published.

    Update:

    I thought of an interesting application upgrade for Google, that is, making a better pdf2text or pdf2html converter. The ones they have now suck although they are better than nothing.

    metadoktor

    "The doktor is in."