Actually the program source code and a sample dataset only come in at under 300k. It's the full blown dataset that comes in at 57MB, I believe and supposedly they can mail that to you sometime during February on five CDs. I downloaded the small source code GZIP'd tar file. It has a few C++ programs in it. Things seem to be Object Oriented but I didn't look at it for very long.

The sample projects that they list on the page don't seem to be very interesting though.

I'm looking forward to the much touted new feature called Recency. At least in the near future you'll know how old the link is.

As for improvements with Google itself, I think, the only obvious thing is to continue to improve their search engine. I still have trouble with searching for something and getting a googol of hits. I don't want a googol of hits. I only want a few that match my specific query. Sometimes I resort to jumping to the 12th hit page because I used to find that on other search engines like HotBot or Altavista that that was were my true hits were located. They weren't at top they were hidden underneath a lot of other garbage. Putting in more words with Google doesn't work too well either because then it comes back and says that it didn't find anything matching your criteria.

I wonder how they search their pages. I've always assumed that search engines used some sort of pre-search where they searched their pages for common words that were not the, a, of, or and to name a few articles/conjunctions and indexed them in some manner so that the word "Banana" might refer to 12,330 web pages.

Here is an excellent article by Paul Boutin @ Webmonkey.com that sheds some light on how Google works. Btw, I seem to recall that he was invited by Google to tour their facility sometime after this article was published.

Update:

I thought of an interesting application upgrade for Google, that is, making a better pdf2text or pdf2html converter. The ones they have now suck although they are better than nothing.

metadoktor

"The doktor is in."


In reply to How Does Google Search Work? - Re: (OT) Google programming contest by metadoktor
in thread (OT) Google programming contest by gmax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.