in reply to Re: Improving a script to search for keywords in a file
in thread Improving a script to search for keywords in a file

guilty as charged, perl is not my first programming language, and i am learning as i go along.
i am a bit confused as to what the below code is doing, where does <input> come from?
what do the "..." represent?
and how do you display the results?
thanks very much
# Initialize the counter my %Counter = map { $_, [] } @keywords; while( my $line = <INPUT> ) # one line at a time { ... if( $line =~ /$keyword/ ) { # $. is the current input line number push @{ $Counter{ $keyword } }, $.; ... }

Replies are listed 'Best First'.
Re^3: Improving a script to search for keywords in a file
by GrandFather (Saint) on Feb 20, 2006 at 01:09 UTC

    No need to feel guilty! You have come to the right place to learn though.

    <INPUT> reads a line from the INPUT file handle. Generally you use open to create a file handle:

    open INPUT, '<', 'theFileName';

    The first ... is just a place holder for lines of code that you might insert at that point. In particular you are likely to chomp $line to remove the line end sequence from $line.

    The second elipsis is a place holder for any lines of code that you might use to do further processing for the current line. Neither affects the important bit which is the hash access push @{ $Counter{ $keyword } }, $.;. That line pushes the current input line number onto the end of an array which is held onto by the counter hash (%counter). You can later list all the lines that contained each key word with something like:

    print "$_ found in lines: @{$counter{$_}}\n" for sort keys %counter;

    Note that $_ is the temporary variable and is an alias to each of the keyword values in %counter. @{$counter{$_}} causes each of the line numbers in the array held on to by the hash entry for the key word to be printed out with a space between them.

    It's not expected that your code will be quite this suscinct on day 1. :) However, if you stick around here for a while and read a few of the other questions and answers, it'll start making sense pretty quick. Good luck, and enjoy Perl.


    DWIM is Perl's answer to Gödel
Re^3: Improving a script to search for keywords in a file
by graff (Chancellor) on Feb 20, 2006 at 01:42 UTC
    To expand a little more...

    If you don't know what "map" does, look it up with "perldoc -f map". (If you haven't learned about "perldoc" yet, run the command line "perldoc perldoc" to read about it, then get accustomed to using it every day. I still do.)

    Basically, you would use your list of keywords to initialize a hash of arrays (%Counter) -- the keywords themselves are the hash keys, and each hash value will be an array of line numbers in your java file where the given keyword occurs.

    Here's one way to complete the app, with an example of how to output the results -- while I'm at it, I'll suggest using a single regex that concatenates all the possible keywords, to make the nested loop a little more efficient:

    use strict; my %Counter; open( K, "input" ) or die "input (keyword file): $!"; while ( <K> ) { chomp; $Counter{$k} = []; # initialize to (a reference to) an empty array } close K; # concatenate the keywords with "|" to form a regular expression: my $key_regex = join( '|', keys %Counter ); open( J, "solution.java" ) or die "solution.java: $!"; while ( <J> ) { while ( /\b($key_regex)\b/g ) { push @{$Counter{$1}}, $.; } } close J; # now print a list of keyword hits: for my $key ( sort keys %Counter ) { next unless scalar @{$Counter{$k}}; print "$key found on lines @{$Counter{$k}}\n"; }
    Note the use of "\b" and parens around "$key_regex" -- this makes sure that we match whole keywords, surrounded by word boundaries, and captures whatever keyword was matched into $1. The "g" modifier on the regex match will produce a list of zero or more matches on every line from the java file, and the inner while loop (having the regex match as its condition) will iterate over every match.

    You can learn more about the Hash of Arrays (HoA) and other data structures by running perldoc on the perldsc man page (data structures cookbook); perlreftut can also be helpful; perlre will explain about \b, etc.

    (update: the proposed solution assumes that the list of keywords is made up entirely of alphabetic (or alphanumeric) words; any non-alphanumeric characters in the keyword list are likely to muck things up, in particular: characters with special meanings in a regex, such as  [.+@#$^*(){}\] -- there are ways around this, but we don't need to go into that yet, I think.)