in reply to Re: Pattern Matching With Regular Expressions
in thread Pattern Matching With Regular Expressions

Why do they have to be referenced? They aren't changing are they? Now it looks like this:
sub findtext { @filenumbers = @_; foreach $number(@filenumbers) { push @filenumbers2, split(/\W/, $number); } foreach $number(@filenumbers2) { chomp $number; if( defined $number ) { open(FILE,"/home/jroberts/$number.txt") or die "$!"; while(<FILE>) { for $term(@inputs) { next unless (/\b($term)\b/i); push @before, split(' ', $`); @before = reverse(@before); @before = splice(@before, 0, 7); @before = reverse(@before); push @after, split(' ', $'); @after = splice(@after, 0, 7); if(exists $results{$number}) { $existing = $results{$number}; $results{$number} = $existing . "... @before" . "<b>$1</b> +" . "@after ..."; } else { $results{$number} = "$url... @before" . "<b>$1</b>" . "@af +ter "; } @before = undef; @after = undef; next; } } close(FILE); print "Match found in $number.txt\n"; @fulltext = $results{$number}; print "@fulltext\n"; } else { next; } next; } }
This still returns only one match per $term. I still don't understand how to "rewind" the file. @before and @after just put text around the matches, kinda like a search engine format.
jroberts

Replies are listed 'Best First'.
Re: Re: Pattern Matching With Regular Expressions
by graff (Chancellor) on Apr 13, 2004 at 04:48 UTC
    At this point (with the code as shown just now), you're only getting one match because every time you find a match, you completely reset the value of "$results{$number}" -- you would want something like this whenever a match is found:
    $results{$number} .= "$url... @before <B>$1</B> @after <BR/>";
    Note the concatenation operator ".="
    Sorry -- I wasn't paying close-enough attention. If you're saying that a given $term might occur more than once on a given line, and you're only getting the first occurrence, not both, yeah, that makes sense. You do a next after processing the first match of $term on each line. The logic I suggested in my update about doing KWIC searches will fix this. Otherwise, you have to do something like:
    for $term ( @$inputs ) { while ( /\b$term\b/g ) { ... } }

    BTW, please note the update I made in my earlier reply, about doing KWIC. I think other replies in this thread have explained about seeking to the beginning of the file, which is now a moot point. no longer relevant.

    Another update, to answer your question about references: you're right, the input args are not being changed, but I'm suggesting that you pass two arrays to the sub: one is a list of files to search in, and the other is a list of terms to search for; using references to arrays allows you to pass both of these in one sub call -- if you don't use array refs, you're just passing an undifferentiated list, and the sub has no way of knowing where one array ends and the other begins. (whew! sorry about the mess!)

      I don't reset the hash, I put $existing into it. Using your concatenated code doesn't work either, only one result. And I don't understand your KWIC code either, I don't think it will work for what I'm doing. How do I use seek in this context.
      Thanks for the help so far,
      jroberts
        Okay, I guess the KWIC thing, the way I originally tried to explain it, is a little off-track for you. Still, if you're goal is something like:
        • Highlight all words that match the set of target terms.
        • Print all occurrences of matching words along with some preceding and following words
        then you should consider storing all the words of the file (in order of occurrence) into a single array, adding highlights to the array elements that happen to match the search terms, and when that's done, go through the array to print out the regions that contain one or more highlighted terms, so the result looks like:
        ... this is a sequence that has target1 as well as target2, where target2 occurs twice in a short span ...

        Try the sub this way (not tested):

        sub findtext { my ($files, $terms) = @_; my @filenames; for my $arg ( @$files ) { push @filenames, grep /\w/, split( /\W+/, $arg ); } my %target; $target{$_} = undef for @$terms; local $/ = undef; # this only applies within the sub for my $file ( @filenames ) { unless ( open( FILE, "/home/jroberts/$file.txt" )) { warn "open failed on $file: $!"; next; } $_ = <FILE>; # read full text; close FILE; my @words = split; # @words has all words in $file for ( @words ) { s{(.*)}{<B>$1</B>} if ( exists( $target{$_} )); } # all target words in $file are now marked, so # print the sequences that contain marked words my $printing = 0; for my $i ( 0 .. $#words ) { if ( $words[$i] =~ /<B>/ ) { if ( $i and $printing == 0 ) { # backtrack for prior + context my $j = ( $i >= 6 ) ? $i - 6 : 0; print join " ", @words[$j..$i-1]; } print $word[$i]; # (update: have to print this every t +ime) $printing = 6; # number of following words to print } elsif ( $printing ) { print $words[$i]; $printing--; print "\n<br/>\n" if ( $printing == 0 ); } } } }
        (updated to always do the right thing when printing out the target strings)