|Just another Perl shrine|
Hi. A well-written engine and a well-compiled database will make a Perl-only engine look pretty good, but you may want to use a C-based engine for all or just backend if you need sheer power. It depends on your speed requirements and how structured the data is. Do you need multiple cursors so that many people can search concurrently? Does the database have to be updatable while being searched?
120K pages @ 4KB/page = 480MB. Maybe cut off 100MB or more if it is HTML. This is not google-scale by magnitudes, rather it is medium-sized for htdig. I have a mod_perl and htdig system running on a gigabyte of data from 60 websites and it is running without a hitch. It is the engine on www.omron.com. The hardware is a 2 year old $6K, 5U RedHat box with 5 RAID disks and a hot standby that provide way more than enough power currently, plus a smaller backup box. Downloading and indexing sequentially takes 17 hours, but the indexing itself is quite fast. I spent a long time making it look not cheesy, which worked at least partly :) . It can do fuzzy searching (maybe turned off now) and stems words with morphological analysis so you hit plurals and gerunds, etc. I also built a Perl-based administration section which proved useful, and they had me add another search site for their corporate news. You can maybe do these things too.
As for your files, that many 4K size pages tells me either you have an awful lot of poetry in your collection, you run a translation company, or the data is probably well-structured. Are you sure you can't get this data into a database and chop it up more? What does it look like and where did the data come from? Some analysis might even make a big hash perform well, memory's cheap. But real text searching involves text analysis (parts of speech, soundex, etc.) some kind of query analysis, and an inverted index, plus maybe caching and threading. Berkeley db also is a good thing here. This is a significant amount of work even if there are many bits of C and Perl code that can help.
There is also the matter of how fast this system has to react, i.e. < 1 second. So building this could cost $30-80K of your time. Making something which performs well in the real world is more than a "simple" job. Also, information indexing is one of the interesting research fields out there, not simplistic at all and you get out exactly what you put into it. If you want some code I could sell you something that works, but it sounds like you want to build it all yourself. In that case have fun! I think you've already got a lot of hints; you have to do some homework on your own and then come back here maybe. Definitely I'd recommend trying to code an inverted index for your own fun, to see how it handles say 100MB of data.
As for other available software, you could consider glimpse, wais, google, IIS Indexing Server, and even mysql's SQL searching (haven't tried it on large data). There is also Ultraseek, but having installed that once (mitsubishi.com) I do not recommend it unless it is short, sweet, and you have bucks. The liscense grows with document (or was it URL) number, there was no developer documentation, and I found myself having to decipher cryptic python code embedded in html a la ASP to do even a minimum of customization. But it works too.
Translation? Please explain this more. A translation server is not cheap, though maybe you want SYSTRAN? Or are you going to be doing this by hand, or throwing it at the fish?
Thanks for your response.
In that case I can recommend breaking down the search problem into pieces if possible to homogenize your data. If you have small similar pages they may be easier to handle with one routine for example. You may be interested in htdig.org for a straightforward index with many little bells and whistles, also you can get into the source code yourself. (I confess I had to remove a robot rule at their request.=8| But htdig is tough to configure and doesn't provide phrase searching. It is not a google beater but maybe a little like Altavista.
There is also namazu.org, made for Japanese language mainly and not as well documented in English, also not as fast but it's useful and popular in Japan. Incidentally, namazu's indexer is Perl and the search is in C, which may be of interest to you.
There are a number of simple, weak perl search programs out there, but I totally do not vouch for them and one or another probably has a known security hole. I saw some at cgi.resourcindex.com. But there is a big difference between indexing 5MB and 500MB.
There is an old wired article I found once at webmonkey about a simple inverted index program that may help illustrate some concepts. But there are a number of C/C++ engines developed over the past 10 years with Perl interfaces. Some people have mentioned glimpse though someone on the htdig ML said their glimpse crashed at 150K documents, 2 years ago. Also swish-e is used for searching CPAN for example and has a Perl API. Maybe you should look at searchtools.com which has links to Perl search programs and many others.
Finally to be fair I shold mention JuggernautSearch which has drawn some flaming on the htdig mailing list. It is listed in the Perl section of searchtools.com and mentions indexing a large number of documents, though it is relatively simple in its indexing. So while I have not tried it, it seems possible that a Perl-only search (which I think Juggernaut is?) could index 150,000 pages in realtime. Anybody else know about this?