in reply to Re: searching for keywords
in thread searching for keywords

m/$keyword/
is wrong. You need to escape special characters. A simple way of doing this is
m/\Q$keyword/

my $pattern = join('|', @keywords );
has the same problem. Use
my $pattern = join('|', map quotemeta, @keywords);
instead.

If the list of words is long, you can speed things up a lot by using Regexp::List:
my $pattern = Regexp::List->new->list2re(@keywords);

All together, we get:

use Regexp::List (); my @keywords = ("keyword1", "keyword2", "keyword3"); my $pattern = Regexp::List->new->list2re(@keywords); #my $pattern = join('|', map quotemeta, @keywords); # Alternative while (<SEARCHFILE>) { if ($_ =~ $pattern) { # or just: if (/$pattern/) { print $_; # or just: print; } }

Replies are listed 'Best First'.
Re^3: searching for keywords
by davido (Cardinal) on Jan 18, 2006 at 06:30 UTC

    Great suggestion on the Regexp::List module. I hadn't investigated it before. I'm impressed with how it optimizes the list to minimize costly alternation. Efficiency seems to have been one of the primary design philosophies.

    Does anyone know if there is a PPM3 build of it anywhere? I didn't find it on the ActiveState repositories. I would love to play with it.

    I toyed with another solution that turns the problem upside down by putting the keywords in a hash, pulling out individual words one by one from the file, and checking for the existance of a given word in the keyword hash. For large keyword lists it could prove more efficient than pure simple alternation since hash lookups occur in O(1) time:

    use strict; use warnings; my %keywords; @keywords{ 'keyword1', 'keyword2', 'keyword3' } = (); while( <DATA> ){ chomp; while( m/\b([\w'-]+)\b/g ) { print "'$_' contains keyword: $1\n" if exists $keywords{ $1 }; } } __DATA__ a line with keyword2 in it a line with keyword1 and keyword3. a line with no keywords. keyword1 can start a line too. and a line can end in keyword2

    Enjoy.


    Dave

      It's Pure Perl. Just unzip its lib/ into your site/lib/
      Nice code. I just want to add if
      while( m/\b([\w'-]+)\b/g ) {
      replace by
      while( m/\b([\w'-]+)\b/gi ) {
      Your program becomes a case independed.

        Not really. Your use of the /i switch is meaningless in this context. The point here is to pull words one by one from the text file, and see if there exists a hash element whos key matches that word. But hash keys themselves are case dependant. All that the regexp is doing is to grab one "word" at a time. That word still has to be found to be a hash key from the keyword hash. There is actually nothing in the regular expression I've used that would be affected by the /i switch in any way, other than to possibly slow down the regexp's execution speed.

        As a matter of fact, my solution is the only one posted thus far in this thread that wouldn't match case insensitively when the /i switch is added. Your post is a good observation if applied to the other answers provided in this thread.


        Dave