in reply to Re^2: fast lookups in files
in thread fast lookups in files

Can you afford 108 MB of ram?

If so, rewrite your file in binary, packing each KV pair using 'NS'. 18885025 * 6 / 2**20 = 108 MB.

Slurp the entire file into a single string and then use a binary chop on it something along the lines of:

open DATA, '<:raw', $datafile or die $!; my $data; sysread( DATA, $data, -s( $datafile ) ) or die $!; close DATA; sub lookup { my $target = shift; my( $left, $right ) = ( 0, length( $data ) / 6 ); while( $left < $right ) { my $mid = int( ( $left + $right ) / 2 ); my( $key, $val ) = unpack 'NS', substr $data, $mid * 6, 6; if( $key < $target ) { $left = $mid +1; } elsif( $key > $target ) { $right = $mid - 1; } elsif( $key == $target ) { return $val; } else { return; } } }

In a quick test this achieved lookups 12,500 per second. (Let's see them do that with an RDBMS :)

Notes: I would not be surprised if the above contains bugs it was thrown together. My test data was only 10e6 lines, so expect a little slower.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^4: fast lookups in files
by citromatik (Curate) on Feb 05, 2008 at 15:48 UTC

    First of all, thanks a lot (and for the others who replied too)

    Can you afford 108 MB of ram?

    Sure!

    rewrite your file in binary, packing each KV pair using 'NS'.

    Well, I'm not sure if I did it right:

    perl -i.bak -lane 'print pack "NS",@F' dataset.txt

    But when applying the binary search, I'm getting weird keys and values (i.e. keys out of range, etc...)

    citromatik

    Update:Also, the resulting binary file is 127Mb (132195175), not 108. Maybe the original file has errors (?). I will check

      Also, there were a couple of bugs in my binary chop as I warned there might be. Anyway, this is a corrected and faster version. Run against a file that contains 20e6 values (from 5 .. 100e6 in steps of 5) randomly picking values approx. half of which should be misses, it achieves ~20,000 lookups / second:

      #! perl -slw use strict; use Math::Random::MT qw[ rand ]; use Benchmark::Timer; my $T = new Benchmark::Timer; open IN, '<:raw', '666269.bin' or die $!; my $data; sysread IN, $data, -s( '666269.bin' ) or die $!; close IN; sub lookup { my $target = pack 'N', shift; my( $left, $right ) = ( 0, ( length( $data ) ) / 6 ); while( $left < $right ) { my $mid = int( ( $left + $right ) / 2 ); my $key = substr $data, $mid * 6, 4; if( $key lt $target ) { $left = $mid +1; } elsif( $key gt $target ) { $right = $mid; } elsif( $key eq $target ) { my( $key, $val ) = unpack 'NS', substr $data, $mid * 6, 6; return $val; } } } my $found = 0; my $n = $ARGV[ 0 ] || 1000; my $label = "Lookup $n values"; $T->start( $label ); for ( 1 .. $n ) { $found++ if lookup( 5*int( rand 19e6 ) + int( rand 2 ) ); } $T->stop( $label ); $T->report; print "found: $found"; __END__ C:\test>666269 1e6 1 trial of Lookup 1e6 values (53.287s total) found: 500062

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.