Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: Perl Search Applicance

by samtregar (Abbot)
on Jun 20, 2002 at 05:00 UTC ( #175911=note: print w/replies, xml ) Need Help??

in reply to Perl Search Applicance

When you say "functionality similar to that of" what do you mean exactly? When I hear that I'm guessing you mean that you need something of similar search result quality and depth of indexing. If that is the case stop right now and go buy the darn thing from Google. You won't get there on your own.

As a reference point, I once built a search engine combining Apache/mod_perl, MySQL and Glimpse. It took around 4 months to complete working alone. It indexed all of the Open Directory project and served most queries in under a second running on a PII/600. The search result format was actually more complicated than Googles - it included the category hierarchy and had advanced tree-limiting features.

The project was generally successful. However, it never came close to providing something comparable to Google. Why not? The search results sucked, to put it mildly. All it did was a simple partial-word match. Glimpse supported more but the more advanced features were too slow to use. Also, the indexing was really really slow. It would never scale to indexing the entire Internet no matter how much hardware you put behind it. As it was it took around 6 hours to index the Open Directory database (although much of that was in character-set translation).

So, in short, be very careful about what you attempt here. If you need Google, buy Google (or one of the competitors like Verity, etc.). If you can make do with much less then you might build it yourself. But have no illusions about what you'll end up with.


Replies are listed 'Best First'.
Re: Re: Perl Search Applicance
by PyroX (Pilgrim) on Jun 20, 2002 at 05:08 UTC
    I mean, the good results, and the ability to classify results ratings and translate some pages, as well as provide search stats and so on.

    Alas, I do not have $80,000 to buy google hardware. (I wish I did)

    This is not going to map the internet, it will be used on about 120000 internal pages, on a couple domains. (Small pages, under 4k mostly)

    If you could point to some of your code as examples, I would appriciate it ;)
      If you don't have $80,000 to buy Google's solution then the chances are very good you don't have the money to develop it either. So the question becomes, can you live with what you can build?

      First off, you want to be able to "translate some pages". That's a pretty tall order. Are you at least planning to buy this piece or are you going to build this too?

      Second, you need to get a clearer idea of how good the search results need to be. Google has the best algorithms in the business and they aren't publishing them! People have a pretty good idea how they do it but replicating it will take a lot of hard work and more than a few braniacs in the barn. I recommend you look at Glimpse to see what some really smart people have been able to do with quite a lot of time. Google it ain't, but it's not bad either. Maybe you can use it as a backend component the way I did.

      Unfortunately the search engine I built went straight to /dev/null along with the company that paid for it. They never sold a single copy to my knowledge. Wasn't the 90s fun?


        Thanks for your advice, I will definatley take a look at Glimpse.

        I am taking on this challenge, if I have to re-invent the wheel I will, and I am not easy to satisfy.
Re: Re: Perl Search Applicance
by johnseq (Initiate) on Jun 21, 2002 at 01:45 UTC
    I think you'll find Harvest to provide the web crawling, indexing and the web search UI you're looking for.

    It's very scalable (can be clustered for both search and crawl IIRC) is pretty mature, and is actively developed.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://175911]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (3)
As of 2023-12-08 02:29 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (35 votes). Check out past polls.