Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Wise Monks,

While I'm reading Learning Perl, I stumble across exercise #2 in Chapter 9. The answer goes like this:

@ARGV = '/path/to/perlfunc.pod'; while(<>) { if (/^=item\s+([a-z_]\w*)/i) { print "$1\n" } }
This is supposed to pull out and print all lines that start with =item and are followed by a Perl identifier name. So using this, I wrote something that should do something similar.
sub findtext { @filenumbers = @_; foreach $number(@filenumbers) { push @filenumbers2, split(/\W/, $number); } foreach $number(@numbers2) { chomp $number; if( defined $number ) { open(FILE,"/home/jroberts/$number.txt") or die "$!"; foreach $term(@inputs) { while(<FILE>) if(/\b($term)\b/i) { push @before, split(' ', $`); @before = reverse(@before); @before = splice(@before, 0, 7); @before = reverse(@before); push @after, split(' ', $'); @after = splice(@after, 0, 7); if(exists $results{$number}) { $existing = $results{$number}; $results{$number} = $existing . "... @before" . "<b>$&</b> +" . "@after ..."; } else { $results{$number} = "... @before" . "<b>$&</b>" . "@after +"; } @before = undef; @after = undef; next; } else { print "No Match\n"; next; } } print "Match found in $number.txt\n"; @fulltext = $results{$number}; print "@fulltext\n"; close(FILE); } else { next; } next; } }
When I put terms into @inputs using @ARGV, this only prints the first match it finds, even though there are more, also additional $term values don't get matched. I know this can be done using grep, but I wanted to write it in perl.

Replies are listed 'Best First'.
Re: Pattern Matching With Regular Expressions
by Paladin (Vicar) on Apr 13, 2004 at 03:37 UTC
    You open your file just inside the if, then read the entire file in the while loop. Then next time through the foreach $term(@inputs) { loop, there is nothing left to read so it exits the loop immediatly.

    You have a few options:

    1. Reverse the while and foreach loops, so it loops through @inputs for each line read. or:
    2. Rewind the file after the while loop.
    3. If the file is small enough, read the entire thing into memory, and use a second foreach loop instead of the while

    I don't know if it matters to you which order you get your matches, nor if the files are small enough, so I'm not sure which option would be better in this case.

    Update: You might also want to reformat your code to make it easier to see where each block ends. Running your code through perltidy produces the following:

      What exactly do you mean by rewind. And by reverse do you mean switch places?
      jroberts
        By rewind he means seek FILE, 0, 0, so that the next read is at the beginning of the file.

        And by reverse the loops, he means change:

        foreach $term (@inputs) { while (<FILE>) {
        to
        while (<FILE>) { foreach $term (@inputs) {

        I'm going to add a few unrelated points:
        Always:

        use warnings; use strict;
        In fact, you probably have a bug of "filenumbers2" vs. "numbers2" that use strict would catch.

        Instead of

        push @before, split(' ', $`);
        you should have
        @before = split(' ', $`);
        Then you don't need the awkward @before = undef; The same holds for @after, too.

        Really it should be

        my @before = split /\s+/, $`;
        The three lines
        @before = reverse(@before); @before = splice(@before, 0, 7); @before = reverse(@before);
        are better written as splice(@before, 0, -7);. And this code:
        if(exists $results{$number}) { $existing = $results{$number}; $results{$number} = $existing . "... @before" . "<b>$&</b>" . "@afte +r ..."; } else { $results{$number} = "... @before" . "<b>$&</b>" . "@after "; }
        can be written as
        $results{$number} .= "... @before" . "<b>$&</b>" . "@after ";
        Sometimes we forget that perl doesn't actually have a function called "rewind" -- but it does have a function called seek, which achieves the same result. The issue is that a normal file handle maintains a "pointer" to the next position to be read (i.e. the end-point of the data that was last read). When you get to the end of the file, this pointer is at EOF, which means nothing else can be read.

        The "seek" function (perldoc -f seek) can reposition the pointer to any position in the file; a typical place to go is back to the beginning:

        seek FILE, 0, 0; # position file pointer at start of file
Re: Pattern Matching With Regular Expressions
by graff (Chancellor) on Apr 13, 2004 at 04:08 UTC
    To begin with, you'll learn quicker if you start with "use strict" -- because this will tell you when you make silly mistakes like misspelling your variable names. I noticed that you create an array called "@filenumbers2" in the first foreach loop, and in the next loop right after that, you refer to an array called "@numbers2" -- "use strict" would catch that.

    Next, you have a third foreach loop (nested within the second one), where you try to read the full content of FILE for each "$term" in "@inputs" -- but after the reading the file for the first "$term", you don't "rewind" the file, which means there's no more data to be read until you close that file and open another one. That's why you're not getting as many matches as you expect.

    To address the latter problem, rethink your logic -- in general, reading from a file is expensive, compared to looping over the elements of an array. So read through the file once, and for each line you read from the file, loop over the elements in @inputs to look for matches. (There are ways to combine a list of patterns into a single regex, by simply joining them together with "|", but we don't need to go there.)

    Another thing that will help is to use a text editor that makes it easy to do consistent indenting, and make your indentation consistent, to reflect looping and conditions.

    There's a lot more that could be done to make the code easier to read, less bulky, and generally better. Here's a one way to start:

    sub findtext { my ($fileargs, $inputs) = @_; # pass references to arrays my @filenames; for my $arg ( @$fileargs ) { # dereference this array push @filenames, grep /\w/, split( /\W+/, $arg ); } for my $file ( @filenames ) { unless open( FILE, "/home/jroberts/$file.txt" ) { warn "open failed on $file: $!"; next; } while (<FILE>) { for my $term ( @$inputs ) { # dereference this array next unless ( /\b$term\b/ ); # get here when there's a match... # (not sure what you want to do here) } } } }
    A lot of your stuff with @before and @after is probably more complicated than it needs to be, but I didn't look at that part so closely... Maybe if you could describe in English (and/or with basic examples) what you're trying to accomplish, you'll figure out an easier way (and maybe the monks can help with that).

    I don't really know where your @inputs is coming from, but I'd suggest that you pass it in as an array reference, to keep the subroutine "modular" (i.e. not dependent on a surrounding context of global variables -- this can be another good side-effect of "use strict"). Regarding this array, do be careful about metacharacters in the array elements -- things like ".&+@%^$" and brackets contained within $item $term will have their magical regex significance unless you put "\Q" and "\E" around the variable when doing the match.

    Update: I get it now -- you're building a concordancer, that will produce a listing of "key words in context" (KWIC). This is a great exercise for honing your perl skills (even though there are numerous open-source and free-ware packages available on the web to do this already -- do a google search for "KWIC"). One suggestion: don't limit your context to individual lines of text -- line breaks are an arbitrary disruption of linguistic content, and it's better to just ignore them. Here's one way, assuming that your .txt files really are just plain text (without any markup or other noise):

    my %input; $input{$_} = undef for @$inputs; # make this a hash $/ = undef; # look up $/ in perldoc perlvar unless ( open( FILE, ...)) { #blah next; } $text = <FILE>; # $text holds entire file content. $text =~ s/\n\n/ &lt;P&gt; /; # (optional: preserve paragraph boun +daries) @words = split( /\s+/, $text ); for my $i (0..$#words) { next unless ( exists( $input{$word[$i]} )); # get here when there's a match, output matched # word along with "N" words of surrounding context # (left as an exercise...) }
      Why do they have to be referenced? They aren't changing are they? Now it looks like this:
      sub findtext { @filenumbers = @_; foreach $number(@filenumbers) { push @filenumbers2, split(/\W/, $number); } foreach $number(@filenumbers2) { chomp $number; if( defined $number ) { open(FILE,"/home/jroberts/$number.txt") or die "$!"; while(<FILE>) { for $term(@inputs) { next unless (/\b($term)\b/i); push @before, split(' ', $`); @before = reverse(@before); @before = splice(@before, 0, 7); @before = reverse(@before); push @after, split(' ', $'); @after = splice(@after, 0, 7); if(exists $results{$number}) { $existing = $results{$number}; $results{$number} = $existing . "... @before" . "<b>$1</b> +" . "@after ..."; } else { $results{$number} = "$url... @before" . "<b>$1</b>" . "@af +ter "; } @before = undef; @after = undef; next; } } close(FILE); print "Match found in $number.txt\n"; @fulltext = $results{$number}; print "@fulltext\n"; } else { next; } next; } }
      This still returns only one match per $term. I still don't understand how to "rewind" the file. @before and @after just put text around the matches, kinda like a search engine format.
      jroberts
        At this point (with the code as shown just now), you're only getting one match because every time you find a match, you completely reset the value of "$results{$number}" -- you would want something like this whenever a match is found:
        $results{$number} .= "$url... @before <B>$1</B> @after <BR/>";
        Note the concatenation operator ".="
        Sorry -- I wasn't paying close-enough attention. If you're saying that a given $term might occur more than once on a given line, and you're only getting the first occurrence, not both, yeah, that makes sense. You do a next after processing the first match of $term on each line. The logic I suggested in my update about doing KWIC searches will fix this. Otherwise, you have to do something like:
        for $term ( @$inputs ) { while ( /\b$term\b/g ) { ... } }

        BTW, please note the update I made in my earlier reply, about doing KWIC. I think other replies in this thread have explained about seeking to the beginning of the file, which is now a moot point. no longer relevant.

        Another update, to answer your question about references: you're right, the input args are not being changed, but I'm suggesting that you pass two arrays to the sub: one is a list of files to search in, and the other is a list of terms to search for; using references to arrays allows you to pass both of these in one sub call -- if you don't use array refs, you're just passing an undifferentiated list, and the sub has no way of knowing where one array ends and the other begins. (whew! sorry about the mess!)