#! perl -slw
use strict;
use List::Util qw[ max ];
our $BUFSIZE ||= 2**20;
my @needles = qw[
2228809700
123456
234567
345678
456789
1234567890
];
my $regex = '(?:' . join( '|', map quotemeta, @needles ) . ')';
my $maxLen = max map length, @needles;
open FILE, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ]: $!";
my( $soFar, $offset ) = ( 0, 0 );
while( my $read = sysread FILE, $_, $BUFSIZE, $offset ) {
while( m[$regex]g ) {
printf "(%d): '%s'\n", pos() + $soFar, substr $_, $-[0], $+[0]
+-$-[0];
}
substr $_, 0, $maxLen, substr $_, -$maxLen;
$soFar += $read;
$offset = $maxLen;
}
The output is: (28749820): '345678' byte offet in the file, followed by the string matched.
The basic principles are:
- to use a largish read size to minimise the number of times you hit the disk and star the regex engine.
Finding the optimium BUFSIZE for your system takes a little experimentation. Larger is not always faster.
- perform the sliding buffer manipulations and read 'in-place', overlaying the same buffer to minimise the work done by the GC.
The manipulations with $maxLen are there to ensure that if a potential match crosses the boundaries of the read size will still be matched. Basically, it retains as many characters as are required to match the longest needle, from the preceding read and append the new read to the end.
That math could be enhanced to reduce the read size by the length of the residual retained.
- building an alternation regex.
This will work better under 5.10, but be aware that there are limits. From memory, more than a few thousand search strings will cause 5.10 to abandon the trie optimisation.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|