in reply to Constructive criticism of a dictionary / text comparison script

This is more a suggestion on functionality than a critique of code. One thing that I ran into with my boggle script is that the unix dict file doesn't have variants of words. For example, it has huge but not hugely, fish but not fishes or fishing, etc.

Ideally, you would have some kind of functionality to address this. One possibility is to stem words before you check them. I know that Lingua::Stem implements one popular algorithim to do this. I didn't look into it close enough to see if it would do the trick for me.

</ajdelore>

  • Comment on Re: Constructive criticism of a dictionary / text comparison script

Replies are listed 'Best First'.
Re: Re: Constructive criticism of a dictionary / text comparison script
by allolex (Curate) on Aug 30, 2003 at 06:35 UTC

    I really like your idea and it would work very well if I were dealing with texts languages that all had a stemming module. I am seriously considering writing one for French. Currently, I am working with Italian, which does have Lingua::Stem::It, but my dictionary has word forms as well. The huge advantage of working with a stemmer is that it is also capable of stemming novel constructions (like stemage), which the dictionary does not account for. It would be a very interesting modification to create a dictionary of stem forms, but it would also be a lot more work checking its accuracy.

    What would really be cool is a stemming module that defined all affixes via a hash of some kind, so that tense, mode/mood, plural, person, etc. could be looked up like

    my %hash_of_verb_suffixes = ( future => qw([ei]rò [ei]rai [ei]rà [ei]remo [ei]rete [ei]ranno), conditional => qw([ei]rei [ei]resti [ei]rebbe [ei]remmo [ei]reste [ +ei]rebbero) )

    and so on.

    Oh, wait. That's a POS tagger;)

    In any case, I can see we think along similar lines. Thanks!

    --
    Allolex