stort83 has asked for the wisdom of the Perl Monks concerning the following question:

Right here's the problem:

i have one file (input), which contains keywords to be matched and solution.java is the file i want to check through for the keywords.

so far i have written a program that will look at the first line (ie keyword) in input and loop through solution.java to see if it exists. the second line in input (ie 2nd keyword) is then used to search through everyline of solution.java to see if it exists. this continues for all lines in input.

so my program will print out "yes the", $keyword, " exists"; for every line it is found in solution.java.

this is great but a counter would be far more useful. if i know how many lines there are in input it is easy to create $counter 1, $counter2 etc for each line. the problem comes as the number of lines of input can vary. So i need a way to create a separate counter for each line in input.

i hope i havent lost anybody and any help would be gratefully received

below is my current loop, HTH
##################### # lines is solution.java # lines2 is input for ($a=0; $a<$#lines+1; $a++){ for ($b=0; $b<$#lines2+1; $b++){ if ($lines[$a] =~ /$lines2[$b]/) { print $lines[$a]; print "\nyes ", $lines2[$b], " exists\n\n"; } } } #####################
cheers Stort

2006-02-21 Retitled by g0n, as per Monastery guidelines
Original title: 'arrays and variables'

Replies are listed 'Best First'.
Re: Improving a script to search for keywords in a file
by brian_d_foy (Abbot) on Feb 19, 2006 at 19:09 UTC

    How big is this file and how many keywords will you look for? This seems like you are going to take up a lot of time. Not only that, but simply finding a substring in the line doesn't mean that it was used as a java keyword (although you may have some other definition). For instance, you might not want to count strings that appear as comments.

    Why not use a Java source code parser, then walk the tree pulling out the keywords and their line numbers? It might not be in Perl (there seem to be many Java implementations), but at least it has a chance of being right. :)

    For the general situation that you're using so far, you just need to add a counter hash. You could simply count:

    if( $line =~ /$keyword/ ) { $Counter{ $keyword } ++; ...

    Or you could do something more fancy, such as remembering the line numbers.

    # Initialize the counter my %Counter = map { $_, [] } @keywords; while( my $line = <INPUT> ) # one line at a time { ... if( $line =~ /$keyword/ ) { # $. is the current input line number push @{ $Counter{ $keyword } }, $.; ... }

    Good luck :)

    --
    brian d foy <brian@stonehenge.com>
    Subscribe to The Perl Review
Re: Improving a script to search for keywords in a file
by GrandFather (Saint) on Feb 19, 2006 at 20:41 UTC

    brian_d_foy gave you pretty much the answer you need, but he didn't mention some of the foibles in your code that make it clear Perl is not your first language. :)

    Perl is pretty good at handling lists of things, to the extent that there is a special version of the for loop to itterate over lists. Your nested loops could be rewritten as:

    for my $line (@lines){ for my $line2 (@lines2){ next if $line !~ /$line2/; print $line; print "\nyes $line2 exists\n\n"; } }

    One side effect of that is to remove $a and $b - which actually are special variables in Perl because they are used as the two arguments passed into a sort block.


    DWIM is Perl's answer to Gödel
      guilty as charged, perl is not my first programming language, and i am learning as i go along.
      i am a bit confused as to what the below code is doing, where does <input> come from?
      what do the "..." represent?
      and how do you display the results?
      thanks very much
      # Initialize the counter my %Counter = map { $_, [] } @keywords; while( my $line = <INPUT> ) # one line at a time { ... if( $line =~ /$keyword/ ) { # $. is the current input line number push @{ $Counter{ $keyword } }, $.; ... }

        No need to feel guilty! You have come to the right place to learn though.

        <INPUT> reads a line from the INPUT file handle. Generally you use open to create a file handle:

        open INPUT, '<', 'theFileName';

        The first ... is just a place holder for lines of code that you might insert at that point. In particular you are likely to chomp $line to remove the line end sequence from $line.

        The second elipsis is a place holder for any lines of code that you might use to do further processing for the current line. Neither affects the important bit which is the hash access push @{ $Counter{ $keyword } }, $.;. That line pushes the current input line number onto the end of an array which is held onto by the counter hash (%counter). You can later list all the lines that contained each key word with something like:

        print "$_ found in lines: @{$counter{$_}}\n" for sort keys %counter;

        Note that $_ is the temporary variable and is an alias to each of the keyword values in %counter. @{$counter{$_}} causes each of the line numbers in the array held on to by the hash entry for the key word to be printed out with a space between them.

        It's not expected that your code will be quite this suscinct on day 1. :) However, if you stick around here for a while and read a few of the other questions and answers, it'll start making sense pretty quick. Good luck, and enjoy Perl.


        DWIM is Perl's answer to Gödel
        To expand a little more...

        If you don't know what "map" does, look it up with "perldoc -f map". (If you haven't learned about "perldoc" yet, run the command line "perldoc perldoc" to read about it, then get accustomed to using it every day. I still do.)

        Basically, you would use your list of keywords to initialize a hash of arrays (%Counter) -- the keywords themselves are the hash keys, and each hash value will be an array of line numbers in your java file where the given keyword occurs.

        Here's one way to complete the app, with an example of how to output the results -- while I'm at it, I'll suggest using a single regex that concatenates all the possible keywords, to make the nested loop a little more efficient:

        use strict; my %Counter; open( K, "input" ) or die "input (keyword file): $!"; while ( <K> ) { chomp; $Counter{$k} = []; # initialize to (a reference to) an empty array } close K; # concatenate the keywords with "|" to form a regular expression: my $key_regex = join( '|', keys %Counter ); open( J, "solution.java" ) or die "solution.java: $!"; while ( <J> ) { while ( /\b($key_regex)\b/g ) { push @{$Counter{$1}}, $.; } } close J; # now print a list of keyword hits: for my $key ( sort keys %Counter ) { next unless scalar @{$Counter{$k}}; print "$key found on lines @{$Counter{$k}}\n"; }
        Note the use of "\b" and parens around "$key_regex" -- this makes sure that we match whole keywords, surrounded by word boundaries, and captures whatever keyword was matched into $1. The "g" modifier on the regex match will produce a list of zero or more matches on every line from the java file, and the inner while loop (having the regex match as its condition) will iterate over every match.

        You can learn more about the Hash of Arrays (HoA) and other data structures by running perldoc on the perldsc man page (data structures cookbook); perlreftut can also be helpful; perlre will explain about \b, etc.

        (update: the proposed solution assumes that the list of keywords is made up entirely of alphabetic (or alphanumeric) words; any non-alphanumeric characters in the keyword list are likely to muck things up, in particular: characters with special meanings in a regex, such as  [.+@#$^*(){}\] -- there are ways around this, but we don't need to go into that yet, I think.)