in reply to Re^2: searching for keywords
in thread searching for keywords

Great suggestion on the Regexp::List module. I hadn't investigated it before. I'm impressed with how it optimizes the list to minimize costly alternation. Efficiency seems to have been one of the primary design philosophies.

Does anyone know if there is a PPM3 build of it anywhere? I didn't find it on the ActiveState repositories. I would love to play with it.

I toyed with another solution that turns the problem upside down by putting the keywords in a hash, pulling out individual words one by one from the file, and checking for the existance of a given word in the keyword hash. For large keyword lists it could prove more efficient than pure simple alternation since hash lookups occur in O(1) time:

use strict; use warnings; my %keywords; @keywords{ 'keyword1', 'keyword2', 'keyword3' } = (); while( <DATA> ){ chomp; while( m/\b([\w'-]+)\b/g ) { print "'$_' contains keyword: $1\n" if exists $keywords{ $1 }; } } __DATA__ a line with keyword2 in it a line with keyword1 and keyword3. a line with no keywords. keyword1 can start a line too. and a line can end in keyword2

Enjoy.


Dave

Replies are listed 'Best First'.
Re^4: searching for keywords
by ikegami (Patriarch) on Jan 18, 2006 at 06:53 UTC
    It's Pure Perl. Just unzip its lib/ into your site/lib/
Re^4: searching for keywords
by Sioln (Sexton) on Jan 18, 2006 at 07:19 UTC
    Nice code. I just want to add if
    while( m/\b([\w'-]+)\b/g ) {
    replace by
    while( m/\b([\w'-]+)\b/gi ) {
    Your program becomes a case independed.

      Not really. Your use of the /i switch is meaningless in this context. The point here is to pull words one by one from the text file, and see if there exists a hash element whos key matches that word. But hash keys themselves are case dependant. All that the regexp is doing is to grab one "word" at a time. That word still has to be found to be a hash key from the keyword hash. There is actually nothing in the regular expression I've used that would be affected by the /i switch in any way, other than to possibly slow down the regexp's execution speed.

      As a matter of fact, my solution is the only one posted thus far in this thread that wouldn't match case insensitively when the /i switch is added. Your post is a good observation if applied to the other answers provided in this thread.


      Dave