Re: searching for a keyword with context window

Reading the file in large chunks is more efficient than going line by line, but then you have to worry about your chunks breaking the target word into more than one piece, which means you have to find the last word piece of each chunk and add it to the beginning of the next chunk, which lowers efficiency. When and if you do find the word, you have to file seek to a place x amount before it, read in the word plus the buffer zone around it, and extract the other 10 words, which means your script is also slow if there are numerous matches. I got about 106 seconds using the following on a 712 MB file containing the target word at the very end:

use strict;
use warnings;

my $fname = 'test.dat';
my $csize = 102400;
my $word = 'bingo';

my ($handle, $length, $end, $pos, $c);

open($handle, $fname);
while () {
    $length = read($handle, $_, $csize);
    if (index($end.$_, $word) != -1) {
        $pos = tell($handle) - $length + index($end.$_, $word) - 100;
        $pos = 0 if $pos < 0;
        seek($handle, $pos, 0);
        read($handle, $_, 220);
        for (@_ = split(/\W+/)) {
            if ($_ eq $word) {
                print join(' ', @_[(($c > 4) ? $c - 5 : 0)..(($c < $#_
+ - 3) ? $c + 5 : $#_)]);
                last;
            } $c++;
        }
        last;
    }
    last if (!$length);
    ($end) = m/[\W](\w*$)/;
}
close($handle);
[download]

Decidedly inefficient, and I haven't provided for multiple matches yet either. It would probably be much better to index the file for all words occuring less than x number of times, and/or use a system utility to find the locations of matches. Even 30 seconds is unacceptable, and under 5 would be a lot better.

Comment on Re: searching for a keyword with context window Download Code