Fastest Search method for strings in large file

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Fastest Search method for strings in large file by salva (Canon) on Jul 14, 2008 at 10:26 UTC
Why don't you take your time and try to explain us in detail what you want to achieve: how many strings you want to match, which kind of data there are on the file, etc. Otherwise that will be like playing piñata!	[reply]
Re^2: Fastest Search method for strings in large file by Anonymous Monk on Jul 14, 2008 at 13:04 UTC
Basic requirement is to extract records from the file conataining the serch string and write it to another file. serach string will be regular sting size of 10 char. no of search strings will around 100-1000. file is delimited file with large no of records. (size 21gb)	[reply]
Re^3: Fastest Search method for strings in large file by BrowserUk (Patriarch) on Jul 14, 2008 at 14:53 UTC
Re: Fastest Search method for strings in large file modified to print whole "\n" dilimited records to stdout: #! perl -slw use strict; use List::Util qw[ max ]; our $BUFSIZE \|\|= 2*16; my @needles = qw[ 12345 67890 ]; my $regex = '(?:' . join( '\|', map quotemeta, @needles ) . ')'; my $maxLen = max map length, @needles; open FILE, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ]: $!"; my( $toRead, $soFar, $offset ) = ( $BUFSIZE, 0, 0 ); while( my $read = sysread FILE, $_, $toRead, $offset ) { if( m[$regex] ) { while( m[^([^\n]$regex[^\n]$)]mg ) { print $1; } } $soFar += $read; my $len = length() - rindex $_, "\n"; substr $_, 0, $len, substr $_, -$len ; $offset = $len; $toRead = $BUFSIZE - $len; } [download] On my system, performance tails off sharply with BUFSIZEs above 2*16. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: Fastest Search method for strings in large file by moritz (Cardinal) on Jul 14, 2008 at 10:35 UTC
With perl 5.10.0 regexes are very fast to search for many constant alternatives. See Re^4: Efficient regex matching with qr//; Can I do better? (Benchmark) for a benchmark comparing perl 5.8.8 and 5.10.0. The details depend on how your search strings look like. If they contain newlines, you can't just read your strings line by line. If you read block by block, you have to check for matches at block boundaries. So please provide more information on both the search strings and the file that is being searched.	[reply]
Re: Fastest Search method for strings in large file by BrowserUk (Patriarch) on Jul 14, 2008 at 12:32 UTC
You could do worse than use a sliding buffer something like this: #! perl -slw use strict; use List::Util qw[ max ]; our $BUFSIZE \|\|= 2**20; my @needles = qw[ 2228809700 123456 234567 345678 456789 1234567890 ]; my $regex = '(?:' . join( '\|', map quotemeta, @needles ) . ')'; my $maxLen = max map length, @needles; open FILE, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ]: $!"; my( $soFar, $offset ) = ( 0, 0 ); while( my $read = sysread FILE, $_, $BUFSIZE, $offset ) { while( m[$regex]g ) { printf "(%d): '%s'\n", pos() + $soFar, substr $_, $-[0], $+[0] +-$-[0]; } substr $_, 0, $maxLen, substr $_, -$maxLen; $soFar += $read; $offset = $maxLen; } [download] The output is: `(28749820): '345678'` byte offet in the file, followed by the string matched. The basic principles are: to use a largish read size to minimise the number of times you hit the disk and star the regex engine. Finding the optimium BUFSIZE for your system takes a little experimentation. Larger is not always faster. perform the sliding buffer manipulations and read 'in-place', overlaying the same buffer to minimise the work done by the GC. The manipulations with `$maxLen` are there to ensure that if a potential match crosses the boundaries of the read size will still be matched. Basically, it retains as many characters as are required to match the longest needle, from the preceding read and append the new read to the end. That math could be enhanced to reduce the read size by the length of the residual retained. building an alternation regex. This will work better under 5.10, but be aware that there are limits. From memory, more than a few thousand search strings will cause 5.10 to abandon the trie optimisation. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re: Fastest Search method for strings in large file by massa (Hermit) on Jul 14, 2008 at 14:10 UTC
Mmap to the rescue! `use strict; use warnings; use Mmap; open my $file, '<', 'big21gfilename.ext' or die; mmap my $f, 0, PROT_READ, MAP_SHARED, $file or die; my $re = '('+join('\|',@ARGV)+')'; $re = qr($re); printf "%d: %s\n", pos($f), $1 while $f =~ /$re/g;` [download] unless, of course, your file has lines, in which case `use strict; use warnings; my $re = '('+join('\|',@ARGV)+')'; $re = qr($re); open my $file, '<', 'big21gfilename.ext' or die; while( <$file> ) { printf "%d (%d): %s\n", $., pos, $1 while /$re/g }` [download] will probably be faster. []s, HTH, Massa	[reply] [d/l] [select]
Re^2: Fastest Search method for strings in large file by BrowserUk (Patriarch) on Jul 14, 2008 at 15:00 UTC
Won't specifying 0 (zero) for the second parm to mmap try to load the whole 21GB into ram? If you specify a smaller size, you'll miss search terms that span buffers. Will produce multiple hits per record. See Re^2: Fastest Search method for strings in large file. The qr is effectively redundant in both versions. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]