in reply to Speeding a disk based hash

Memory access time is measured in nanoseconds. Disk access time is measured in milliseconds. You will inevitably get a slowdown measured in orders of magnitude when you tie to disk.

There are 3 basic approaches:

  1. Buy more memory - probably cheaper than recoding - and leave your hash in memory.
  2. Improve your algorithm to be more efficient. Do you really need (all) that hash?
  3. Move the data to an RDBMS which is designed to efficiently index and manipulate large quantities of data.

Which approach works best for you depends on what it is you (think) you need the hash for. If you provide more details of the precise problem you are trying to solve then you will probably get a few useful algorithmic suggestions. Why is loading time an issue - is this a dynamic CGI/Tk type app? How much memory have you got? OS? Version of Perl?

cheers

tachyon

Replies are listed 'Best First'.
Re^2: Speeding a disk based hash
by albert (Monk) on Oct 11, 2004 at 02:45 UTC
      1. Memory is not limiting on the box, as the memory limit I am hitting is an internal Perl one. So, I need to overcome the Perl limit.
      2. I am importing 20M+ records, and need to track the IDs of the records I import. But your point is well taken, and I'll look to see how I might modify how I am importing things.
      3. This import is just an intermediary as I am importing into a RDBMS. I'm just trying to import large datasets as quickly as possible.

    From your comments on access time, it doesn't sound as though I can greatly speed things with this disk-based approach. This is what I suspected, though I had hoped to find wisdom pointing me to another solution.

    Thanks.

      Well despite my asking you still don't really supply useful detail. Probably Out Of Memory error at 950MB with 14GB free RAM is appropriate given you say you have a lot of memory.

      Now this is total speculation but you call it an intermediate step. This makes me think you are either doing a merge or a filter based on the content of the hash. Either case can be dealt with by using a merge sort strategy. If the data in your hash is stored in a sorted flat file (sorted by hash key) and the data it is to be merged with/filtered against is similarly sorted then you can very efficiently make a single pass theough both files in lockstep, generating a final output file. The basic algorithm is to open both files, and read a line from each. If the keys are the same do a merge and output, if not read another line from the file where the key < other_file_key. Thus you rapidly walk both files and find all mathcing keys.

      cheers

      tachyon

      How about use another DB to hold this HASH? Just create a table with a simple DB like SQLite, and see if you can win some speed with that.

      Graciliano M. P.
      "Creativity is the expression of the liberty".