Re^3: Improving a script to search for keywords in a file

To expand a little more...

If you don't know what "map" does, look it up with "perldoc -f map". (If you haven't learned about "perldoc" yet, run the command line "perldoc perldoc" to read about it, then get accustomed to using it every day. I still do.)

Basically, you would use your list of keywords to initialize a hash of arrays (%Counter) -- the keywords themselves are the hash keys, and each hash value will be an array of line numbers in your java file where the given keyword occurs.

Here's one way to complete the app, with an example of how to output the results -- while I'm at it, I'll suggest using a single regex that concatenates all the possible keywords, to make the nested loop a little more efficient:

use strict;

my %Counter;
open( K, "input" ) or die "input (keyword file): $!";
while ( <K> ) {
    chomp;
    $Counter{$k} = []; # initialize to (a reference to) an empty array
}
close K;

# concatenate the keywords with "|" to form a regular expression:
my $key_regex = join( '|', keys %Counter );

open( J, "solution.java" ) or die "solution.java: $!";
while ( <J> ) {
    while ( /\b($key_regex)\b/g ) {
        push @{$Counter{$1}}, $.;
    }
}
close J;

# now print a list of keyword hits:

for my $key ( sort keys %Counter ) {
    next unless scalar @{$Counter{$k}};
    print "$key found on lines @{$Counter{$k}}\n";
}
[download]

Note the use of "\b" and parens around "$key_regex" -- this makes sure that we match whole keywords, surrounded by word boundaries, and captures whatever keyword was matched into $1. The "g" modifier on the regex match will produce a list of zero or more matches on every line from the java file, and the inner while loop (having the regex match as its condition) will iterate over every match.

You can learn more about the Hash of Arrays (HoA) and other data structures by running perldoc on the perldsc man page (data structures cookbook); perlreftut can also be helpful; perlre will explain about \b, etc.

(update: the proposed solution assumes that the list of keywords is made up entirely of alphabetic (or alphanumeric) words; any non-alphanumeric characters in the keyword list are likely to muck things up, in particular: characters with special meanings in a regex, such as [.+@#$^*(){}\] -- there are ways around this, but we don't need to go into that yet, I think.)

Comment on Re^3: Improving a script to search for keywords in a file Select or Download Code