Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Natural Language Index Stemming

by rob_au (Abbot)
on Jun 18, 2002 at 01:13 UTC ( [id://175245]=perlquestion: print w/replies, xml ) Need Help??

rob_au has asked for the wisdom of the Perl Monks concerning the following question:

I am curious as to the experience of others with regard to their experience with natural language stemming for site indexes. I ask this as I am in the process of rewriting a site search engine (to improve maintainability and to fit the corporate application environment) and have could across a number of discussions regarding natural language stemming in this type of application.

For those unfamiliar with this concept, stemming is the process of reducing a word to its stem or root form - This allows similar words such as computer and computing to be conflated or reduced to a single root (for example, comput), thereby reducing index dictionary size and in theory, reducing storage requirements and processing time - A further discussion on this concept can be found here.

While this type of processing allows for reducing index dictionary keys, I am concerned about he likelihood for stemming errors whereby dissimilar words may be stemmed to a similar root, particularly given that indexing speed and space requirements should not be an issue in the application environment - See here for a discussion on over- and under-stemming errors.

And so I ask a barage of questions:

  • What are the experiences of fellow monks with natural language stemming?
  • Have other monks found better results, as measured by minimal stemming errors, via one stemming algorithm (for example, Paice-Husk, Porter, etc.) over another?
  • And in particular, what are other monks experiences with the Porter algorithm of stemming implemented in Lingua::Stem?
 

My thanks in advance

 

Replies are listed 'Best First'.
Re: Natural Language Index Stemming
by cjf (Parson) on Jun 18, 2002 at 04:58 UTC

    As for Lingua::Stem, I just tried out a few examples from Stemming Performance that you linked to:

    use strict; use Lingua::Stem; my $stemmer = Lingua::Stem->new(); my @words = qw/maintained maintenance environment experience/; my $stems = $stemmer->stem(@words); print "$_ " for (@$stems);

    The output was:

    maintain mainten environ experi

    So it appears to have failed to merge maintain with maintenance(?), but correctly dealed with the environment/experience difference described on that page. This is the first time I've looked into the subject, so I could be a fair bit off the mark :).

    As for other (sort of) related modules, I've found TheDamian's Lingua::EN::Inflect to be useful (and fun) to use on occasion. I'm not sure how much that applies to your question though.

    ++ for an interesting thread, I look forward to hearing what your conclusions are.

    Edited 18 June 2002 (footpad): Fixed broken </code> tag.

Re: Natural Language Index Stemming
by samtregar (Abbot) on Jun 18, 2002 at 01:56 UTC
    I built a search engine in Perl that used Glimpse as the backend searcher. It supports several varieties of stemming that were available as options in my system. It seemed to work as advertised.

    -sam

Re: Natural Language Index Stemming
by perrin (Chancellor) on Jun 18, 2002 at 01:25 UTC
    The Porter algorithm worked well enough for us, when building the search engine for etoys.com. I haven't tried any others. The implementation we used was actually in C though.
Re: Natural Language Index Stemming
by toma (Vicar) on Jun 18, 2002 at 06:23 UTC
    I used the Lingua::Stem when I made concordances of some Shakespeare and Melville texts that I dowloaded from Project Gutenberg. I found that the stemming was quite conservative for my purposes, erring on the side of avoiding collisions.

    My more challenging problem was the proper choice of stoplist words, which would not be indexed at all.

    I will someday integrate stemming into my Style and Spelling Checker, I hope.

    It should work perfectly the first time! - toma

Re: Natural Language Index Stemming
by simon.proctor (Vicar) on Jun 18, 2002 at 07:44 UTC
    I used Paice Husk stemming for my search engine and used MLDBM and Storable for creating the index. I also used a second index to cache the HTML meta data.

    I quite liked Paice Husk as it translated to Perl very easily. I just had to keep the rules in an array and reverse all fragments of my search terms.

    If you want an alternative to Lingua::Stem then I seriously recommend it. You can find the paper here They also give an (old) Perl example which should help provide a basis of your app if you choose to try it.
Re: Natural Language Index Stemming
by PetaMem (Priest) on Jun 18, 2002 at 11:23 UTC
    Aaah my lovely favourite subfield of interest...

    first off, you can diferenciate between knowledge based stemming algorithms and probabilistic stemming. And of course there is a bunch of heuristic mixture of these two aproaches spread all over the literature and the web. If you want something "not so good, but good enough and not expensive", you could use the next generation of old stemmer. See Snowball. Snowball is quite ok, especially because there are descriptions for more languages. However you never will be able to gain 100% accuracy with this approach, as only a dictionary of a given lang together with morphology knowledge will give you best (but still ambiguous) results.

    But this requires heavy duty hardware, where heavy duty software can run on...

    Bye
     PetaMem

Re: Natural Language Index Stemming
by quinkan (Monk) on Jun 19, 2002 at 06:49 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://175245]
Approved by gav^
Front-paged by cLive ;-)
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-03-28 20:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found