comment on

You could do worse than use a sliding buffer something like this:

#! perl -slw
use strict;
use List::Util qw[ max ];

our $BUFSIZE ||= 2**20;

my @needles = qw[
    2228809700
    123456
    234567
    345678
    456789
    1234567890
];

my $regex = '(?:' . join( '|', map quotemeta, @needles ) . ')';
my $maxLen = max map length, @needles;

open FILE, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ]: $!";

my( $soFar, $offset ) = ( 0, 0 );
while( my $read = sysread FILE, $_, $BUFSIZE, $offset ) {
    while( m[$regex]g ) {
        printf "(%d): '%s'\n", pos() + $soFar, substr $_, $-[0], $+[0]
+-$-[0];
    }
    substr $_, 0, $maxLen, substr $_, -$maxLen;
    $soFar += $read;
    $offset = $maxLen;
}
[download]

The output is: (28749820): '345678' byte offet in the file, followed by the string matched.

The basic principles are:

to use a largish read size to minimise the number of times you hit the disk and star the regex engine.
Finding the optimium BUFSIZE for your system takes a little experimentation. Larger is not always faster.
perform the sliding buffer manipulations and read 'in-place', overlaying the same buffer to minimise the work done by the GC.
The manipulations with $maxLen are there to ensure that if a potential match crosses the boundaries of the read size will still be matched. Basically, it retains as many characters as are required to match the longest needle, from the preceding read and append the new read to the end.
That math could be enhanced to reduce the read size by the length of the residual retained.
building an alternation regex.
This will work better under 5.10, but be aware that there are limits. From memory, more than a few thousand search strings will cause 5.10 to abandon the trie optimisation.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re: Fastest Search method for strings in large file by BrowserUk
in thread Fastest Search method for strings in large file by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.