comment on

Can you help me find a set of relevant keys in a more efficient manner?

I want to count the total number of times a word stem appears in a hash. Here is a short example:

use strict;
use Lingua::Stem::Snowball;
my $idea  = 'books';
my %words = ( 'books'        => 5,
             'library'       => 6,
             'librarianship' => 5,
             'librarians'    => 3,
             'librarian'     => 3,
             'book'          => 3,
             'museums'       => 2
           );
my $stemmer   = Lingua::Stem::Snowball->new( lang => 'en' );
my $idea_stem = $stemmer->stem( $idea );
print "$idea ($idea_stem)\n";
my $total = 0;
foreach my $word ( keys %words ) {
 my $word_stem = $stemmer->stem( $word );
 print "\t$word ($word_stem)\n";
 if ( $idea_stem eq $word_stem ) { $total += $words{ $word } }
}
print "$total\n";
[download]

In the end, the value of $total equals 8. That is, more or less, what I expect, but how can I make the foreach loop more efficient? In reality, my application fills %words up as many as 150,000 keys. Moreover, $idea is really just a single element in an array of about 100 words. Doing the math, the if statement in my foreach loop will get executed as many as 1,500,000 times. To make matters even worse, I plan to run the whole program about 10,000 times. Do the math. That is a whole lot of processing just to count words!

Is there someway I could short-circuit the foreach loop? I saw Lingua::Stem::Snowball's stem_in_place method, but to use it I must pass it an array disassociating my keys from their values.

Second, is there a way I can make the stemming more aggressive? For example, I was hoping the stem of library would equal the stems of library, librarianship, and librarian, but alas, they don't.

Any suggestions?

In reply to finding a set of relevant keys by ericleasemorgan

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.