comment on

Okay, I guess the KWIC thing, the way I originally tried to explain it, is a little off-track for you. Still, if you're goal is something like:

Highlight all words that match the set of target terms.
Print all occurrences of matching words along with some preceding and following words

then you should consider storing all the words of the file (in order of occurrence) into a single array, adding highlights to the array elements that happen to match the search terms, and when that's done, go through the array to print out the regions that contain one or more highlighted terms, so the result looks like:

... this is a sequence that has target1 as well as target2, where target2 occurs twice in a short span ...

Try the sub this way (not tested):

sub findtext
{
    my ($files, $terms) = @_;
    my @filenames;
    for my $arg ( @$files ) {
        push @filenames, grep /\w/, split( /\W+/, $arg );
    }
    my %target;
    $target{$_} = undef for @$terms;

    local $/ = undef;  # this only applies within the sub

    for my $file ( @filenames )
    {
        unless ( open( FILE, "/home/jroberts/$file.txt" )) {
            warn "open failed on $file: $!";
            next;
        }
        $_ = <FILE>;  # read full text;
        close FILE;

        my @words = split;  # @words has all words in $file
        for ( @words ) {
            s{(.*)}{<B>$1</B>} if ( exists( $target{$_} ));
        }

        # all target words in $file are now marked, so
        # print the sequences that contain marked words

        my $printing = 0;
        for my $i ( 0 .. $#words )
        {
            if ( $words[$i] =~ /<B>/ ) {
                if ( $i and $printing == 0 ) {   # backtrack for prior
+ context
                    my $j = ( $i >= 6 ) ? $i - 6 : 0;
                    print join " ", @words[$j..$i-1];
                }
                print $word[$i]; # (update: have to print this every t
+ime)
                $printing = 6;  # number of following words to print
            }
            elsif ( $printing ) {
                print $words[$i];
                $printing--;
                print "\n<br/>\n" if ( $printing == 0 );
            }
        }
    }
}
[download]

(updated to always do the right thing when printing out the target strings)

In reply to Re: Re: Pattern Matching With Regular Expressions by graff
in thread Pattern Matching With Regular Expressions by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.