in reply to Fastest Search method for strings in large file
You could do worse than use a sliding buffer something like this:
#! perl -slw use strict; use List::Util qw[ max ]; our $BUFSIZE ||= 2**20; my @needles = qw[ 2228809700 123456 234567 345678 456789 1234567890 ]; my $regex = '(?:' . join( '|', map quotemeta, @needles ) . ')'; my $maxLen = max map length, @needles; open FILE, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ]: $!"; my( $soFar, $offset ) = ( 0, 0 ); while( my $read = sysread FILE, $_, $BUFSIZE, $offset ) { while( m[$regex]g ) { printf "(%d): '%s'\n", pos() + $soFar, substr $_, $-[0], $+[0] +-$-[0]; } substr $_, 0, $maxLen, substr $_, -$maxLen; $soFar += $read; $offset = $maxLen; }
The output is: (28749820): '345678' byte offet in the file, followed by the string matched.
The basic principles are:
Finding the optimium BUFSIZE for your system takes a little experimentation. Larger is not always faster.
The manipulations with $maxLen are there to ensure that if a potential match crosses the boundaries of the read size will still be matched. Basically, it retains as many characters as are required to match the longest needle, from the preceding read and append the new read to the end.
That math could be enhanced to reduce the read size by the length of the residual retained.
This will work better under 5.10, but be aware that there are limits. From memory, more than a few thousand search strings will cause 5.10 to abandon the trie optimisation.
|
|---|