in reply to Rapid text searches
If, as your sample data suggests, the key numbers are strictly ascending, though not sequential, and your records are fixed length (or can easily be made so with a fast one-pass one-liner), then you can find your values very quickly (O(1) time and space!), by calculating the approximate position and then seeking forward or backward from that starting point.
The following crude demo code:
#! perl -sw use 5.010; use strict; use Time::HiRes qw[ time ]; my $start = time; my $recLen = 43; open I, '<', '768941.dat' or die $!; my $first = ( split ' ', scalar <I> )[ 0 ]; seek I, -$recLen, 2; ## SEEK_END my $last = ( split ' ', scalar <I> )[ 0 ]; say "$first - $last"; my $recs = tell( I ) / $recLen; my $factor = $recs / ( $last - $first ); my $success = 0; while( my $target = <> ) { chomp $target; warn "Input outside range" if $target > $last or $target < $first; my $offset = int( ( $target - $first ) * $factor ) * $recLen; my $rec; seek I, $offset, 0 or die "Seek failed\n"; my $found = ( split ' ', $rec = <I> )[ 0 ]; if( $found != $target ) { my $direction = $found < $target; $offset = $direction ? $recLen : ( -$recLen * 2 ); seek I, $offset, 1 or die "Seek failed" ## SEEK_CUR while ( $found = ( split ' ', $rec = <I> )[ 0 ] ) and $direction ? ( $found < $target ) : ( $found > $target + ); } if( $found == $target ) { print "Found: $rec"; ++$success; } else { warn "Target $target not found\n"; } } close I; printf STDERR "Found $success records in %.3f seconds\n", time() - $st +art;
Looks up 1000 (existing) records:
c:\test>wc -l 768941.search 1000 768941.search c:\test>head 768941.search 1101077781160 1101077781161 1101077781162 1101077781163 1101077781164 1101077781165 1101077781166 1101077781167 1101077781168 1101077781169
in an approximation of your data file:
c:\test>dir 768941.dat 06/06/2009 03:55 1,625,416,082 768941.dat c:\test>head 768941.dat 1101077781160 1101077783656 bothDegen 1101077781161 1101077783657 bothChaff 1101077781162 1101077783658 bothDegen 1101077781163 1101077783659 bothChaff 1101077781164 1101077783660 oneChaff 1101077781165 1101077783661 bothChaff 1101077781166 1101077783662 oneDegen 1101077781167 1101077783663 bothDegen 1101077781168 1101077783664 bothChaff 1101077781169 1101077783665 bothDegen c:\test>wc -l 768941.dat 37800374 768941.dat
In two or three hundreths of a second:
c:\test>768941 768941.search >nul Found 1000 records in 0.02 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.03 seconds c:\test>768941 768941.search >nul Found 1000 records in 0.03 seconds
But that is with a warm cache. It took a whole 1/10th of a second the first time :) That's faster than you could make a connection to most databases.
The trouble with people who have sledgehammers, everything looks like a tough nut.
|
|---|