Improving a script to search for keywords in a file

stort83 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Improving a script to search for keywords in a file by brian_d_foy (Abbot) on Feb 19, 2006 at 19:09 UTC
How big is this file and how many keywords will you look for? This seems like you are going to take up a lot of time. Not only that, but simply finding a substring in the line doesn't mean that it was used as a java keyword (although you may have some other definition). For instance, you might not want to count strings that appear as comments. Why not use a Java source code parser, then walk the tree pulling out the keywords and their line numbers? It might not be in Perl (there seem to be many Java implementations), but at least it has a chance of being right. :) For the general situation that you're using so far, you just need to add a counter hash. You could simply count: `if( $line =~ /$keyword/ ) { $Counter{ $keyword } ++; ...` [download] Or you could do something more fancy, such as remembering the line numbers. `# Initialize the counter my %Counter = map { $_, [] } @keywords; while( my $line = <INPUT> ) # one line at a time { ... if( $line =~ /$keyword/ ) { # $. is the current input line number push @{ $Counter{ $keyword } }, $.; ... }` [download] Good luck :) -- brian d foy <brian@stonehenge.com> Subscribe to The Perl Review	[reply] [d/l] [select]
Re: Improving a script to search for keywords in a file by GrandFather (Saint) on Feb 19, 2006 at 20:41 UTC
brian_d_foy gave you pretty much the answer you need, but he didn't mention some of the foibles in your code that make it clear Perl is not your first language. :) Perl is pretty good at handling lists of things, to the extent that there is a special version of the for loop to itterate over lists. Your nested loops could be rewritten as: `for my $line (@lines){ for my $line2 (@lines2){ next if $line !~ /$line2/; print $line; print "\nyes $line2 exists\n\n"; } }` [download] One side effect of that is to remove $a and $b - which actually are special variables in Perl because they are used as the two arguments passed into a `sort` block. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Improving a script to search for keywords in a file by Anonymous Monk on Feb 19, 2006 at 23:18 UTC
guilty as charged, perl is not my first programming language, and i am learning as i go along. i am a bit confused as to what the below code is doing, where does <input> come from? what do the "..." represent? and how do you display the results? thanks very much `# Initialize the counter my %Counter = map { $_, [] } @keywords; while( my $line = <INPUT> ) # one line at a time { ... if( $line =~ /$keyword/ ) { # $. is the current input line number push @{ $Counter{ $keyword } }, $.; ... }` [download]	[reply] [d/l]
Re^3: Improving a script to search for keywords in a file by GrandFather (Saint) on Feb 20, 2006 at 01:09 UTC
No need to feel guilty! You have come to the right place to learn though. `<INPUT>` reads a line from the INPUT file handle. Generally you use open to create a file handle: `open INPUT, '<', 'theFileName';` [download] The first `...` is just a place holder for lines of code that you might insert at that point. In particular you are likely to `chomp $line` to remove the line end sequence from $line. The second elipsis is a place holder for any lines of code that you might use to do further processing for the current line. Neither affects the important bit which is the hash access `push @{ $Counter{ $keyword } }, $.;`. That line pushes the current input line number onto the end of an array which is held onto by the counter hash (`%counter`). You can later list all the lines that contained each key word with something like: `print "$_ found in lines: @{$counter{$_}}\n" for sort keys %counter;` [download] Note that `$_` is the temporary variable and is an alias to each of the keyword values in %counter. `@{$counter{$_}}` causes each of the line numbers in the array held on to by the hash entry for the key word to be printed out with a space between them. It's not expected that your code will be quite this suscinct on day 1. :) However, if you stick around here for a while and read a few of the other questions and answers, it'll start making sense pretty quick. Good luck, and enjoy Perl. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^3: Improving a script to search for keywords in a file by graff (Chancellor) on Feb 20, 2006 at 01:42 UTC
To expand a little more... If you don't know what "map" does, look it up with "perldoc -f map". (If you haven't learned about "perldoc" yet, run the command line "perldoc perldoc" to read about it, then get accustomed to using it every day. I still do.) Basically, you would use your list of keywords to initialize a hash of arrays (%Counter) -- the keywords themselves are the hash keys, and each hash value will be an array of line numbers in your java file where the given keyword occurs. Here's one way to complete the app, with an example of how to output the results -- while I'm at it, I'll suggest using a single regex that concatenates all the possible keywords, to make the nested loop a little more efficient: use strict; my %Counter; open( K, "input" ) or die "input (keyword file): $!"; while ( <K> ) { chomp; $Counter{$k} = []; # initialize to (a reference to) an empty array } close K; # concatenate the keywords with "\|" to form a regular expression: my $key_regex = join( '\|', keys %Counter ); open( J, "solution.java" ) or die "solution.java: $!"; while ( <J> ) { while ( /\b($key_regex)\b/g ) { push @{$Counter{$1}}, $.; } } close J; # now print a list of keyword hits: for my $key ( sort keys %Counter ) { next unless scalar @{$Counter{$k}}; print "$key found on lines @{$Counter{$k}}\n"; } [download] Note the use of "\b" and parens around "$key_regex" -- this makes sure that we match whole keywords, surrounded by word boundaries, and captures whatever keyword was matched into $1. The "g" modifier on the regex match will produce a list of zero or more matches on every line from the java file, and the inner while loop (having the regex match as its condition) will iterate over every match. You can learn more about the Hash of Arrays (HoA) and other data structures by running perldoc on the perldsc man page (data structures cookbook); perlreftut can also be helpful; perlre will explain about \b, etc. (update: the proposed solution assumes that the list of keywords is made up entirely of alphabetic (or alphanumeric) words; any non-alphanumeric characters in the keyword list are likely to muck things up, in particular: characters with special meanings in a regex, such as `[.+@#$^*(){}\]` -- there are ways around this, but we don't need to go into that yet, I think.)	[reply] [d/l] [select]