in reply to HTML::Index module -- what's the story?

You're right that there are few small options for a search engine, but that's because user expectations for the vast majority of applications are difficult to meet with a small scale engine. Are your users really going to stand for non-stemmed searching? You've done this before, so of course I'll take your word for it, but that definitely puts you in the minority.

All the search engine libraries use Lingua::Stem or Lingua::Stem::Snowball, because it would make zero sense to reinvent that wheel. They only come in one package -- Lingua::Stem installs Lingua::Stem::Snowball -- which is mildly unfortunate because Snowball is XS and you need a C compiler. However, I can testify that it's very difficult to write a search engine which scales well to extremely large document collections in pure Perl.

--
Marvin Humphrey
Rectangular Research ― http://www.rectangular.com
  • Comment on Re: HTML::Index module -- what's the story?

Replies are listed 'Best First'.
Re^2: HTML::Index module -- what's the story?
by snowhare (Friar) on Nov 22, 2005 at 14:18 UTC

    As the author of Lingua::Stem I have to correct this: Lingua::Stem is a pure Perl module collection. That is in fact probably the single largest practical difference between it and Lingua::Stem::Snowball (which is entirely XS based). While Lingua::Stem uses Lingua::Stem::Snowball::Da, Lingua::Stem::Snowball::No and Lingua::Stem::Snowball::Se as 'plugin' components - those modules are standalone pure Perl items that are completely independant of the main Lingua::Stem::Snowball distribution even though they share Lingua::Stem::Snowball's namespace.

    As to the complaint that Lingua::Stem installs unwanted European stemmers, I think that is a matter of perspective: Some Europeans might complain that it installs an unwanted English stemmer ;).

    Distributions like Lingua::Stem and Lingua::Stem::Snowball have multiple user bases by design. They are intended to create standards for implementing the type of module so that there are not dozens of different APIs and namespaces for modules that all basically do the same thing for slightly different audiences. Other than using a small amount of extra disk space, that there are features you don't need for your particular use isn't really an issue as long as their presence doesn't interfere with your use.