If, as your sample data suggests, the key numbers are strictly ascending, though not sequential, and your records are fixed length (or can easily be made so with a fast one-pass one-liner), then you can find your values very quickly (O(1) time and space!), by calculating the approximate position and then seeking forward or backward from that starting point.

The following crude demo code:

#! perl -sw use 5.010; use strict; use Time::HiRes qw[ time ]; my $start = time; my $recLen = 43; open I, '<', '768941.dat' or die $!; my $first = ( split ' ', scalar <I> )[ 0 ]; seek I, -$recLen, 2; ## SEEK_END my $last = ( split ' ', scalar <I> )[ 0 ]; say "$first - $last"; my $recs = tell( I ) / $recLen; my $factor = $recs / ( $last - $first ); my $success = 0; while( my $target = <> ) { chomp $target; warn "Input outside range" if $target > $last or $target < $first; my $offset = int( ( $target - $first ) * $factor ) * $recLen; my $rec; seek I, $offset, 0 or die "Seek failed\n"; my $found = ( split ' ', $rec = <I> )[ 0 ]; if( $found != $target ) { my $direction = $found < $target; $offset = $direction ? $recLen : ( -$recLen * 2 ); seek I, $offset, 1 or die "Seek failed" ## SEEK_CUR while ( $found = ( split ' ', $rec = <I> )[ 0 ] ) and $direction ? ( $found < $target ) : ( $found > $target + ); } if( $found == $target ) { print "Found: $rec"; ++$success; } else { warn "Target $target not found\n"; } } close I; printf STDERR "Found $success records in %.3f seconds\n", time() - $st +art;

Looks up 1000 (existing) records:

c:\test>wc -l 768941.search 1000 768941.search c:\test>head 768941.search 1101077781160 1101077781161 1101077781162 1101077781163 1101077781164 1101077781165 1101077781166 1101077781167 1101077781168 1101077781169

in an approximation of your data file:

c:\test>dir 768941.dat 06/06/2009 03:55 1,625,416,082 768941.dat c:\test>head 768941.dat 1101077781160 1101077783656 bothDegen 1101077781161 1101077783657 bothChaff 1101077781162 1101077783658 bothDegen 1101077781163 1101077783659 bothChaff 1101077781164 1101077783660 oneChaff 1101077781165 1101077783661 bothChaff 1101077781166 1101077783662 oneDegen 1101077781167 1101077783663 bothDegen 1101077781168 1101077783664 bothChaff 1101077781169 1101077783665 bothDegen c:\test>wc -l 768941.dat 37800374 768941.dat

In two or three hundreths of a second:

c:\test>768941 768941.search >nul Found 1000 records in 0.02 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.03 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.03 seconds

But that is with a warm cache. It took a whole 1/10th of a second the first time :) That's faster than you could make a connection to most databases.

The trouble with people who have sledgehammers, everything looks like a tough nut.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re: Rapid text searches ( O(1) space and time) by BrowserUk
in thread Rapid text searches by joomanji

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.