punkish has asked for the wisdom of the Perl Monks concerning the following question:

Yesterday I decided to "graduate" my application's home-grown search mechanism (which worked quite well, mind you, but in a limited way since it required a SQL database to work, and wouldn't do fuzzy searches) to something more "institutional". Searching around in the CPAN-space, I found Plucene and Kinosearch, both Perlized versions of Lucene. Since I had more awareness of Plucene, and since its version number is higher than Kino ;-), I decided to give it a whirl.

Likely I am being dense, but I found Plucene's documentation very sparse. In fact, in the beginning I didn't have any idea of even the first steps. There was a suggestion in the docs to read a couple of "onjava" articles on Lucene given that Plucene mirrored Lucene so well. I read those articles, but since I understand zip about Java, that didn't get me anywhere.

Then I found Plucene::Simple. That got my first steps going, but I was stuck again. Then I found a circa 2004 article by Simon Cozens on perl.com which actually gave step-by-step instructions on how to implement a web-based search. That was nice, but surprising that it took so much digging around.

The article referenced above required downloading other modules (Text::Context and family, among others) in order to make the search results nicer looking. After all this, I still don't know how to do several other things that my home-grown solution had such as calculating and ordering by relevance scores.

I have some experience with implementing Swish-e and have found that to be relatively easier than my experience with Plucene above. There is also ht:dig... maybe other mechanisms as well.

  1. What do other monks implement for searching?
  2. How do Plucene and Kinosearch compare?
  3. What is their relative longevity?
  4. Is there a canonical "Perl way" of searching through websites?
--

when small people start casting long shadows, it is time to go to bed
  • Comment on looking for a good Perl-way for implementing website search

Replies are listed 'Best First'.
Re: looking for a good Perl-way for implementing website search
by perrin (Chancellor) on May 09, 2006 at 15:01 UTC
    Swish-e is great and has an easy Perl API. If you already know it, I don't see any reason to use something else.
      I don't see any reason to use something else
      I know Swish-e, and yes, it is great. It is, however, not "the simplest thing that could work." Or, is it?

      Since it is a C program, I can't implement it on a web host unless they have it installed there (although, my new host -- http://www.dreamhost.com -- does seem to allow compiling ones own programs).

      --

      when small people start casting long shadows, it is time to go to bed
        It's almost certainly simpler than any of the others you mentioned. They are more like toolkits for building a custom search engine. Swish-e is ready-to-use and complete.
Re: looking for a good Perl-way for implementing website search
by zentara (Cardinal) on May 09, 2006 at 16:20 UTC
Re: looking for a good Perl-way for implementing website search
by blazar (Canon) on May 09, 2006 at 14:52 UTC

    Wild guess: http://nms-cgi.sf.net/ may help. You may be interested into checking how they implemented their "simple web site search engine".

      The nms search program is a reimplementation of Matt Wright's program and we don't call it simple search for nothing. If you're considering things like Plucene, then simple search will almost certainly not be powerful enough for you.

      --
      <http://dave.org.uk>

      "The first rule of Perl club is you do not talk about Perl club."
      -- Chip Salzenberg

      The NMS Simple search is constrained by the requirement that it must be installable on an average shared hosting account with no additional modules or shell access and that it can be dropped in as a direct replacement for the MSA Search. Basically it greps the content of the files every time it makes a search - this obviously isn't the ideal way to do it. There is a TODO item to implement a search that doesn't need to be compatible with the MSA one but, y'know, time ....

      /J\

      My original home-grown solution was actually much better than nms simple search. I actually had a table of inverted index of words, a table of stop words, and a master dictionary table (essentially SQL-constructed from the standard dict that comes with Unix) for avoiding duplication. As I said, worked very well except for stemming and fuzzy searches, but required a SQL db. nms will be going backward.

      I do have Plucene implemented now as an experimental mechanism. Would like to hear it compared to Kinosearch? Additionally, as I said in my OP, Plucene is pretty sparely documented. It took a lot of digging around, and I still don't know, for example, how to score relevancy. Surely, Plucene can't be the canonical Perl website search mechanism if it is so (IMO) sparely documented, and the last update was almost a year ago. Kinosearch logs show some recent activity, but fwiw, Plucene is "1.14" version numbers ahead ;-).

      --

      when small people start casting long shadows, it is time to go to bed
Re: looking for a good Perl-way for implementing website search
by john_oshea (Priest) on May 10, 2006 at 12:01 UTC

    FWIW, we use Xapian/Omega which is very nice indeed (esp. handling unicode data).

Re: looking for a good Perl-way for implementing website search
by Anonymous Monk on Mar 13, 2007 at 15:42 UTC
    This is exactly what I LOVE about Perlmonks (the unoriginal ditto heads who hang out in the CB are what i hate). I am doing a little research on Plucene and I concur that the docs so sparse that their purpose seems to drive people away. I found myself stuck, so I come here and do a quick Super Search and voila!

    Thanks for linking to the Simon Cozens article. It was a great help to me.