I'll give you some general comments - I'm speaking in the most vague generalities because I don't want to overshare any technical details, but I worked at Blekko, a search engine that was built ground-up with Perl.

First, you will need a fast, high-capacity datastore. Probably not an SQL database; you'll want something very fast, very expandable, and very dependable, with a lot of internal redundancy. Look into the NoSQL key-value datastores, or you may decide you need to write one. Blekko wrote one.

Second, you will want fast indexing and a way to quickly run queries across the whole of your database; something like Hadoop or BigTable. Blekko wrote one.

I'm leaving out the vast majority of details here because they're proprietary information, but the summary is that you'll need a big (hundreds of machines), fast datastore to store your crawl and index, and a good mechanism to access it quickly. Blekko wrote all these.

It's taken Blekko 4 years to get to where they are now (with is the "pretty darn good, better than Google some places, not as good others" with about 20 people (though they started with about 7 or 8). You're in for a long-haul process, and your backers will need to be patient. Writing a search engine is not easy, and will go better if you have folks who have already worked on one for a while onboard.

Crawlers are easy; search engines are hard.


In reply to Re: internet search engine by pemungkah
in thread internet search engine by olowodara

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.