comment on

If, as your sample data suggests, the key numbers are strictly ascending, though not sequential, and your records are fixed length (or can easily be made so with a fast one-pass one-liner), then you can find your values very quickly (O(1) time and space!), by calculating the approximate position and then seeking forward or backward from that starting point.

The following crude demo code:

#! perl -sw
use 5.010;
use strict;
use Time::HiRes qw[ time ];

my $start = time;

my $recLen = 43;

open I, '<', '768941.dat' or die $!;
my $first = ( split ' ', scalar <I> )[ 0 ];

seek I, -$recLen, 2; ## SEEK_END
my $last  = ( split ' ', scalar <I> )[ 0 ];
say "$first - $last";

my $recs = tell( I ) / $recLen;
my $factor = $recs / ( $last - $first );

my $success = 0;
while( my $target = <> ) {
    chomp $target;

    warn "Input outside range"
        if $target > $last or $target < $first;

    my $offset = int( ( $target - $first ) * $factor ) * $recLen;

    my $rec;
    seek I, $offset, 0 or die "Seek failed\n";
    my $found = ( split ' ', $rec = <I> )[ 0 ];

    if( $found != $target ) {
        my $direction = $found < $target;
        $offset =  $direction ? $recLen : ( -$recLen * 2 );

        seek I, $offset, 1 or die "Seek failed" ## SEEK_CUR
            while ( $found = ( split ' ', $rec = <I> )[ 0 ] )
            and $direction ? ( $found < $target ) : ( $found > $target
+ );
    }
    if( $found == $target ) {
        print "Found: $rec";
        ++$success;
    }
    else {
        warn "Target $target not found\n";
    }
}
close I;

printf STDERR "Found $success records in %.3f seconds\n", time() - $st
+art;
[download]

Looks up 1000 (existing) records:

c:\test>wc -l 768941.search
   1000 768941.search

c:\test>head 768941.search
1101077781160
1101077781161
1101077781162
1101077781163
1101077781164
1101077781165
1101077781166
1101077781167
1101077781168
1101077781169
[download]

in an approximation of your data file:

c:\test>dir 768941.dat
06/06/2009  03:55     1,625,416,082 768941.dat

c:\test>head 768941.dat
1101077781160   1101077783656   bothDegen
1101077781161   1101077783657   bothChaff
1101077781162   1101077783658   bothDegen
1101077781163   1101077783659   bothChaff
1101077781164   1101077783660    oneChaff
1101077781165   1101077783661   bothChaff
1101077781166   1101077783662    oneDegen
1101077781167   1101077783663   bothDegen
1101077781168   1101077783664   bothChaff
1101077781169   1101077783665   bothDegen

c:\test>wc -l 768941.dat
37800374 768941.dat
[download]

In two or three hundreths of a second:

c:\test>768941 768941.search >nul
Found 1000 records in 0.02 seconds

c:\test>768941 768941.search >nul
Found 1000 records in 0.03 seconds

c:\test>768941 768941.search >nul
Found 1000 records in 0.03 seconds
[download]

But that is with a warm cache. It took a whole 1/10th of a second the first time :) That's faster than you could make a connection to most databases.

The trouble with people who have sledgehammers, everything looks like a tough nut.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re: Rapid text searches ( O(1) space and time) by BrowserUk
in thread Rapid text searches by joomanji

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.