Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am looking to find out the fastest possible way to search list of string from a file (size 21gb). can u guys give any idea example or methodology ?
  • Comment on Fastest Search method for strings in large file

Replies are listed 'Best First'.
Re: Fastest Search method for strings in large file
by salva (Canon) on Jul 14, 2008 at 10:26 UTC
    Why don't you take your time and try to explain us in detail what you want to achieve: how many strings you want to match, which kind of data there are on the file, etc.

    Otherwise that will be like playing piņata!

      Basic requirement is to extract records from the file conataining the serch string and write it to another file. serach string will be regular sting size of 10 char. no of search strings will around 100-1000. file is delimited file with large no of records. (size 21gb)

        Re: Fastest Search method for strings in large file modified to print whole "\n" dilimited records to stdout:

        #! perl -slw use strict; use List::Util qw[ max ]; our $BUFSIZE ||= 2**16; my @needles = qw[ 12345 67890 ]; my $regex = '(?:' . join( '|', map quotemeta, @needles ) . ')'; my $maxLen = max map length, @needles; open FILE, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ]: $!"; my( $toRead, $soFar, $offset ) = ( $BUFSIZE, 0, 0 ); while( my $read = sysread FILE, $_, $toRead, $offset ) { if( m[$regex] ) { while( m[^([^\n]*$regex[^\n]*$)]mg ) { print $1; } } $soFar += $read; my $len = length() - rindex $_, "\n"; substr $_, 0, $len, substr $_, -$len ; $offset = $len; $toRead = $BUFSIZE - $len; }

        On my system, performance tails off sharply with BUFSIZEs above 2**16.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fastest Search method for strings in large file
by moritz (Cardinal) on Jul 14, 2008 at 10:35 UTC
    With perl 5.10.0 regexes are very fast to search for many constant alternatives. See Re^4: Efficient regex matching with qr//; Can I do better? (Benchmark) for a benchmark comparing perl 5.8.8 and 5.10.0.

    The details depend on how your search strings look like. If they contain newlines, you can't just read your strings line by line. If you read block by block, you have to check for matches at block boundaries. So please provide more information on both the search strings and the file that is being searched.

Re: Fastest Search method for strings in large file
by BrowserUk (Patriarch) on Jul 14, 2008 at 12:32 UTC

    You could do worse than use a sliding buffer something like this:

    #! perl -slw use strict; use List::Util qw[ max ]; our $BUFSIZE ||= 2**20; my @needles = qw[ 2228809700 123456 234567 345678 456789 1234567890 ]; my $regex = '(?:' . join( '|', map quotemeta, @needles ) . ')'; my $maxLen = max map length, @needles; open FILE, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ]: $!"; my( $soFar, $offset ) = ( 0, 0 ); while( my $read = sysread FILE, $_, $BUFSIZE, $offset ) { while( m[$regex]g ) { printf "(%d): '%s'\n", pos() + $soFar, substr $_, $-[0], $+[0] +-$-[0]; } substr $_, 0, $maxLen, substr $_, -$maxLen; $soFar += $read; $offset = $maxLen; }

    The output is: (28749820): '345678' byte offet in the file, followed by the string matched.

    The basic principles are:

    1. to use a largish read size to minimise the number of times you hit the disk and star the regex engine.

      Finding the optimium BUFSIZE for your system takes a little experimentation. Larger is not always faster.

    2. perform the sliding buffer manipulations and read 'in-place', overlaying the same buffer to minimise the work done by the GC.

      The manipulations with $maxLen are there to ensure that if a potential match crosses the boundaries of the read size will still be matched. Basically, it retains as many characters as are required to match the longest needle, from the preceding read and append the new read to the end.

      That math could be enhanced to reduce the read size by the length of the residual retained.

    3. building an alternation regex.

      This will work better under 5.10, but be aware that there are limits. From memory, more than a few thousand search strings will cause 5.10 to abandon the trie optimisation.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fastest Search method for strings in large file
by massa (Hermit) on Jul 14, 2008 at 14:10 UTC
    Mmap to the rescue!
    use strict; use warnings; use Mmap; open my $file, '<', 'big21gfilename.ext' or die; mmap my $f, 0, PROT_READ, MAP_SHARED, $file or die; my $re = '('+join('|',@ARGV)+')'; $re = qr($re); printf "%d: %s\n", pos($f), $1 while $f =~ /$re/g;

    unless, of course, your file has lines, in which case

    use strict; use warnings; my $re = '('+join('|',@ARGV)+')'; $re = qr($re); open my $file, '<', 'big21gfilename.ext' or die; while( <$file> ) { printf "%d (%d): %s\n", $., pos, $1 while /$re/g }
    will probably be faster.
    []s, HTH, Massa