in reply to fast lookups in files

  1. How many lines in your file?
  2. Over what range are the keys and values spread?
  3. How many lookups per load?

    Eg. Is this a web app that does half a dozen lookups per load, or a system app that does thousands per load.

  4. Is concurrency a requirement?
  5. How often does the dataset change?
  6. How fast a lookup are you hoping for?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: fast lookups in files
by citromatik (Curate) on Feb 05, 2008 at 14:42 UTC
    How many lines in your file?

    18885025

    Over what range are the keys and values spread?

    keys from 6 to 164664533
    values from 0 to 494514

    How many lookups per load?

    This is a stand-alone system application that does up to thousands lookups per run (although it could be called for just one lookup).This is an intermediate step in the app. The others step are well optimized for performance.

    Is concurrency a requirement?

    In principle, no

    How often does the dataset change?

    More or less, once in a month

    How fast a lookup are you hoping for?

    As much as I can. As I told before, the rest of the script is optimized for speed and memory usage and it is a bit frustrating to see how the others parts of the program (which seems to be much more complicated), runs very fast and these lookups are slowing down so much the overall process

    Thanks,

    citromatik

      Can you afford 108 MB of ram?

      If so, rewrite your file in binary, packing each KV pair using 'NS'. 18885025 * 6 / 2**20 = 108 MB.

      Slurp the entire file into a single string and then use a binary chop on it something along the lines of:

      open DATA, '<:raw', $datafile or die $!; my $data; sysread( DATA, $data, -s( $datafile ) ) or die $!; close DATA; sub lookup { my $target = shift; my( $left, $right ) = ( 0, length( $data ) / 6 ); while( $left < $right ) { my $mid = int( ( $left + $right ) / 2 ); my( $key, $val ) = unpack 'NS', substr $data, $mid * 6, 6; if( $key < $target ) { $left = $mid +1; } elsif( $key > $target ) { $right = $mid - 1; } elsif( $key == $target ) { return $val; } else { return; } } }

      In a quick test this achieved lookups 12,500 per second. (Let's see them do that with an RDBMS :)

      Notes: I would not be surprised if the above contains bugs it was thrown together. My test data was only 10e6 lines, so expect a little slower.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        First of all, thanks a lot (and for the others who replied too)

        Can you afford 108 MB of ram?

        Sure!

        rewrite your file in binary, packing each KV pair using 'NS'.

        Well, I'm not sure if I did it right:

        perl -i.bak -lane 'print pack "NS",@F' dataset.txt

        But when applying the binary search, I'm getting weird keys and values (i.e. keys out of range, etc...)

        citromatik

        Update:Also, the resulting binary file is 127Mb (132195175), not 108. Maybe the original file has errors (?). I will check