ericleasemorgan has asked for the wisdom of the Perl Monks concerning the following question:

Can you help me find a set of relevant keys in a more efficient manner?

I want to count the total number of times a word stem appears in a hash. Here is a short example:

use strict; use Lingua::Stem::Snowball; my $idea  = 'books'; my %words = ( 'books'        => 5,              'library'       => 6,              'librarianship' => 5,              'librarians'    => 3,              'librarian'     => 3,              'book'          => 3,              'museums'       => 2            ); my $stemmer   = Lingua::Stem::Snowball->new( lang => 'en' ); my $idea_stem = $stemmer->stem( $idea ); print "$idea ($idea_stem)\n"; my $total = 0; foreach my $word ( keys %words ) {  my $word_stem = $stemmer->stem( $word );  print "\t$word ($word_stem)\n";  if ( $idea_stem eq $word_stem ) { $total += $words{ $word } } } print "$total\n";

In the end, the value of $total equals 8. That is, more or less, what I expect, but how can I make the foreach loop more efficient? In reality, my application fills %words up as many as 150,000 keys. Moreover, $idea is really just a single element in an array of about 100 words. Doing the math, the if statement in my foreach loop will get executed as many as 1,500,000 times. To make matters even worse, I plan to run the whole program about 10,000 times. Do the math. That is a whole lot of processing just to count words!

Is there someway I could short-circuit the foreach loop? I saw Lingua::Stem::Snowball's stem_in_place method, but to use it I must pass it an array disassociating my keys from their values.

Second, is there a way I can make the stemming more aggressive? For example, I was hoping the stem of library would equal the stems of library, librarianship, and librarian, but alas, they don't.

Any suggestions?

Replies are listed 'Best First'.
Re: finding a set of relevant keys
by Limbic~Region (Chancellor) on Oct 13, 2009 at 15:53 UTC
    ericleasemorgan,
    You're data structure doesn't allow short circuiting. You could do a 1 time pass of %words and create a parallel data structure that would make subsequent searches a hash lookup. Something like this:
    my %stems; for my $word (keys %words) { my $stem = $stemmer->stem($word); $stems{$stem} += $words{$word}; } for my $idea (@ideas) { my $stem = $stemmer->$stem($idea); my $val = $stems{$stem} || 0; print "$idea\t$val\n"; }

    Of course, if %words is constantly changing then so must %stems. There are ways to do this (tied hashes for instance) or just a better datastructure (tree versus hash). As far as better stemming library - the library you are using does support exceptions. When you say you are going to run the program 10K times, it makes me wonder if you mean each time it is only going to get the value of a single word. That would obviously be inefficient. It would be better to serialize (freeze/thaw) your data structure so you don't pay the runtime performance penalty of converting the hash.

    Cheers - L~R

      Creating a "parallel" data structure in the manner you describe is what others have suggested as well. Thank you!

        ericleasemorgan,
        To be honest, changing %words to a more appropriate data structure would probably be best. This way you could walk the tree. Unfortunately, without knowing how %words is used by the rest of the program or how it is maintained, that is a blind recommendation.

        Cheers - L~R

Re: finding a set of relevant keys
by spx2 (Deacon) on Oct 14, 2009 at 11:22 UTC

    First for making it more efficient.stem_in_place is a very good fit for this. however,because it will modify all your words you should have a way to keep track of the initial words. you can choose to make something like:

    my @words_stemmed = @words; $stemmer->stem_in_place(\@words_stemmed); my @all; # will contain hashrefs like { original => 'books' , stemmed +=> 'book' } map{ +{ original => $words[$_], stemmed => $words_stemmed[$_], }; } 0..-1+@words;

    I also think you should use ->stem_in_place with some kind of global cache(but that would mean adding this to the .xs if it's not already there).

    For performance related issues you can use Gearman or POE to distribute the work on many machines.It's not a surprise that just "counting words" is time-consuming, a lot of things in NLP are.

    PetaMem here on Perlmonks also does NLP so maybe he can also give some details on how this can be improved

    UPDATE: took a closer look to the OP , read more carefuly