Re: Pattern Matching With Regular Expressions

To begin with, you'll learn quicker if you start with "use strict" -- because this will tell you when you make silly mistakes like misspelling your variable names. I noticed that you create an array called "@filenumbers2" in the first foreach loop, and in the next loop right after that, you refer to an array called "@numbers2" -- "use strict" would catch that.

Next, you have a third foreach loop (nested within the second one), where you try to read the full content of FILE for each "$term" in "@inputs" -- but after the reading the file for the first "$term", you don't "rewind" the file, which means there's no more data to be read until you close that file and open another one. That's why you're not getting as many matches as you expect.

To address the latter problem, rethink your logic -- in general, reading from a file is expensive, compared to looping over the elements of an array. So read through the file once, and for each line you read from the file, loop over the elements in @inputs to look for matches. (There are ways to combine a list of patterns into a single regex, by simply joining them together with "|", but we don't need to go there.)

Another thing that will help is to use a text editor that makes it easy to do consistent indenting, and make your indentation consistent, to reflect looping and conditions.

There's a lot more that could be done to make the code easier to read, less bulky, and generally better. Here's a one way to start:

sub findtext
{
    my ($fileargs, $inputs) = @_; # pass references to arrays
    my @filenames;
    for my $arg ( @$fileargs ) {  # dereference this array
        push @filenames, grep /\w/, split( /\W+/, $arg );
    }
    for my $file ( @filenames ) {
        unless open( FILE, "/home/jroberts/$file.txt" ) {
            warn "open failed on $file: $!";
            next;
        }
        while (<FILE>) {
            for my $term ( @$inputs ) { # dereference this array
                next unless ( /\b$term\b/ );
                # get here when there's a match...
                # (not sure what you want to do here)
            }
        }
    }
}
[download]

A lot of your stuff with @before and @after is probably more complicated than it needs to be, but I didn't look at that part so closely... Maybe if you could describe in English (and/or with basic examples) what you're trying to accomplish, you'll figure out an easier way (and maybe the monks can help with that).

I don't really know where your @inputs is coming from, but I'd suggest that you pass it in as an array reference, to keep the subroutine "modular" (i.e. not dependent on a surrounding context of global variables -- this can be another good side-effect of "use strict"). Regarding this array, do be careful about metacharacters in the array elements -- things like ".&+@%^$" and brackets contained within ~~$item~~ $term will have their magical regex significance unless you put "\Q" and "\E" around the variable when doing the match.

Update: I get it now -- you're building a concordancer, that will produce a listing of "key words in context" (KWIC). This is a great exercise for honing your perl skills (even though there are numerous open-source and free-ware packages available on the web to do this already -- do a google search for "KWIC"). One suggestion: don't limit your context to individual lines of text -- line breaks are an arbitrary disruption of linguistic content, and it's better to just ignore them. Here's one way, assuming that your .txt files really are just plain text (without any markup or other noise):

    my %input;
    $input{$_} = undef for @$inputs;  # make this a hash

    $/ = undef; # look up $/ in perldoc perlvar
    unless ( open( FILE, ...)) {
        #blah
        next;
    }
    $text = <FILE>;  # $text holds entire file content.
    $text =~ s/\n\n/ &lt;P&gt; /; # (optional: preserve paragraph boun
+daries)

    @words = split( /\s+/, $text );

    for my $i (0..$#words) {
        next unless ( exists( $input{$word[$i]} ));

        # get here when there's a match, output matched
        # word along with "N" words of surrounding context
        # (left as an exercise...)
    }
[download]

Comment on Re: Pattern Matching With Regular Expressions Select or Download Code

Replies are listed 'Best First'.
Re: Pattern Matching With Regular Expressions by Anonymous Monk on Apr 13, 2004 at 04:37 UTC
Why do they have to be referenced? They aren't changing are they? Now it looks like this: sub findtext { @filenumbers = @_; foreach $number(@filenumbers) { push @filenumbers2, split(/\W/, $number); } foreach $number(@filenumbers2) { chomp $number; if( defined $number ) { open(FILE,"/home/jroberts/$number.txt") or die "$!"; while(<FILE>) { for $term(@inputs) { next unless (/\b($term)\b/i); push @before, split(' ', $`); @before = reverse(@before); @before = splice(@before, 0, 7); @before = reverse(@before); push @after, split(' ', $'); @after = splice(@after, 0, 7); if(exists $results{$number}) { $existing = $results{$number}; $results{$number} = $existing . "... @before" . "<b>$1</b> +" . "@after ..."; } else { $results{$number} = "$url... @before" . "<b>$1</b>" . "@af +ter "; } @before = undef; @after = undef; next; } } close(FILE); print "Match found in $number.txt\n"; @fulltext = $results{$number}; print "@fulltext\n"; } else { next; } next; } } [download] This still returns only one match per $term. I still don't understand how to "rewind" the file. @before and @after just put text around the matches, kinda like a search engine format. jroberts	[reply] [d/l]
Re: Re: Pattern Matching With Regular Expressions by graff (Chancellor) on Apr 13, 2004 at 04:48 UTC
At this point (with the code as shown just now), you're only getting one match because every time you find a match, you completely reset the value of "$results{$number}" -- you would want something like this whenever a match is found: `$results{$number} .= "$url... @before <B>$1</B> @after <BR/>";` [download] Note the concatenation operator ".=" Sorry -- I wasn't paying close-enough attention. If you're saying that a given $term might occur more than once on a given line, and you're only getting the first occurrence, not both, yeah, that makes sense. You do a next after processing the first match of $term on each line. The logic I suggested in my update about doing KWIC searches will fix this. Otherwise, you have to do something like: `for $term ( @$inputs ) { while ( /\b$term\b/g ) { ... } }` [download] ~~BTW, please note the update I made in my earlier reply, about doing KWIC. I think other replies in this thread have explained about seeking to the beginning of the file, which is now a moot point.~~ no longer relevant. Another update, to answer your question about references: you're right, the input args are not being changed, but I'm suggesting that you pass two arrays to the sub: one is a list of files to search in, and the other is a list of terms to search for; using references to arrays allows you to pass both of these in one sub call -- if you don't use array refs, you're just passing an undifferentiated list, and the sub has no way of knowing where one array ends and the other begins. (whew! sorry about the mess!)	[reply] [d/l] [select]
Re: Pattern Matching With Regular Expressions by Anonymous Monk on Apr 13, 2004 at 05:01 UTC
I don't reset the hash, I put $existing into it. Using your concatenated code doesn't work either, only one result. And I don't understand your KWIC code either, I don't think it will work for what I'm doing. How do I use seek in this context. Thanks for the help so far, jroberts	[reply]
Re: Re: Pattern Matching With Regular Expressions by graff (Chancellor) on Apr 13, 2004 at 05:55 UTC
Re: Re: Pattern Matching With Regular Expressions by Anonymous Monk on Apr 28, 2004 at 03:00 UTC
Some notes below your chosen depth have not been shown here