Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
When you say "functionality similar to that of google.com" what do you mean exactly? When I hear that I'm guessing you mean that you need something of similar search result quality and depth of indexing. If that is the case stop right now and go buy the darn thing from Google. You won't get there on your own.

As a reference point, I once built a search engine combining Apache/mod_perl, MySQL and Glimpse. It took around 4 months to complete working alone. It indexed all of the Open Directory project and served most queries in under a second running on a PII/600. The search result format was actually more complicated than Googles - it included the category hierarchy and had advanced tree-limiting features.

The project was generally successful. However, it never came close to providing something comparable to Google. Why not? The search results sucked, to put it mildly. All it did was a simple partial-word match. Glimpse supported more but the more advanced features were too slow to use. Also, the indexing was really really slow. It would never scale to indexing the entire Internet no matter how much hardware you put behind it. As it was it took around 6 hours to index the Open Directory database (although much of that was in character-set translation).

So, in short, be very careful about what you attempt here. If you need Google, buy Google (or one of the competitors like Verity, etc.). If you can make do with much less then you might build it yourself. But have no illusions about what you'll end up with.

-sam


In reply to Re: Perl Search Applicance by samtregar
in thread Perl Search Applicance by PyroX

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (4)
As of 2022-12-08 03:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?