Speeding a disk based hash

albert has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Speeding a disk based hash by tachyon (Chancellor) on Oct 11, 2004 at 00:34 UTC
Memory access time is measured in nanoseconds. Disk access time is measured in milliseconds. You will inevitably get a slowdown measured in orders of magnitude when you tie to disk. There are 3 basic approaches: Buy more memory - probably cheaper than recoding - and leave your hash in memory. Improve your algorithm to be more efficient. Do you really need (all) that hash? Move the data to an RDBMS which is designed to efficiently index and manipulate large quantities of data. Which approach works best for you depends on what it is you (think) you need the hash for. If you provide more details of the precise problem you are trying to solve then you will probably get a few useful algorithmic suggestions. Why is loading time an issue - is this a dynamic CGI/Tk type app? How much memory have you got? OS? Version of Perl? cheers tachyon	[reply]
Re^2: Speeding a disk based hash by albert (Monk) on Oct 11, 2004 at 02:45 UTC
1. Memory is not limiting on the box, as the memory limit I am hitting is an internal Perl one. So, I need to overcome the Perl limit. 2. I am importing 20M+ records, and need to track the IDs of the records I import. But your point is well taken, and I'll look to see how I might modify how I am importing things. 3. This import is just an intermediary as I am importing into a RDBMS. I'm just trying to import large datasets as quickly as possible. From your comments on access time, it doesn't sound as though I can greatly speed things with this disk-based approach. This is what I suspected, though I had hoped to find wisdom pointing me to another solution. Thanks.	[reply]
Re^3: Speeding a disk based hash by tachyon (Chancellor) on Oct 11, 2004 at 03:19 UTC
Well despite my asking you still don't really supply useful detail. Probably Out Of Memory error at 950MB with 14GB free RAM is appropriate given you say you have a lot of memory. Now this is total speculation but you call it an intermediate step. This makes me think you are either doing a merge or a filter based on the content of the hash. Either case can be dealt with by using a merge sort strategy. If the data in your hash is stored in a sorted flat file (sorted by hash key) and the data it is to be merged with/filtered against is similarly sorted then you can very efficiently make a single pass theough both files in lockstep, generating a final output file. The basic algorithm is to open both files, and read a line from each. If the keys are the same do a merge and output, if not read another line from the file where the key < other_file_key. Thus you rapidly walk both files and find all mathcing keys. cheers tachyon	[reply]
Re^3: Speeding a disk based hash by gmpassos (Priest) on Oct 11, 2004 at 02:58 UTC
How about use another DB to hold this HASH? Just create a table with a simple DB like SQLite, and see if you can win some speed with that. Graciliano M. P. "Creativity is the expression of the liberty".	[reply]
Re: Speeding a disk based hash by Zaxo (Archbishop) on Oct 11, 2004 at 00:30 UTC
What kind of database are you originally loading from? It sounds like you are slurping some kind of flat file and hashing for searches. It might be better to populate the BerkeleyDB once and keep it around as the primary database. Other databases like SQLite might be preferable. A sharper picture of what you're doing would help. After Compline, Zaxo	[reply]
Re^2: Speeding a disk based hash by albert (Monk) on Oct 11, 2004 at 02:37 UTC
I am trying to load data from flat files into Postgres. The script is a a bulk loader which generates individual text files for COPY loading of each table, and indexes are only created once the tables are loaded into Postgres. In the script, I need to keep track of the IDs as they are generated. Without using the disk based hash, this is a nice way to greatly speed import from a straight load into a schema with indexes, etc. I'm only using BerkeleyDB to track the hash while building the input tables.	[reply]
Re^3: Speeding a disk based hash by BrowserUk (Patriarch) on Oct 11, 2004 at 03:11 UTC
If speed is the issue, avoiding the disk is your best technique. And if the issue with that is that you're using too much memory for your hash, then their are ways to build a datastructure that has some subset of the properties of a hash, without requiring the memory requirements of a hash. Whether these are useful in your situation will depend on which properties if a hash you need for your application, and which you do not? Your description says "I need to keep track of the IDs as they are generated...", which suggests you are using the hash to do 'fast lookups'. If this is the only property you need of the hash, then you might be able to use the methods I described in A (memory) poor man's <strike>hash</strike> lookup table.. It's still slower than a hash, but much faster than a disk-based one, and depending upon the size and makeup of the keys can use as little as 15% of the memory required by a real hash. More information on how you're using the hash, how many and what type & range of keys, is needed to decide if it is applicable to your requirements. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]
Re^3: Speeding a disk based hash by revdiablo (Prior) on Oct 11, 2004 at 16:42 UTC
Another aproach that may or may not work, depending on your exact requirements, is loading the keys into memory in blocks. If you're going through the indexes sequentially, then it may make sense to first load, say, 1000 from disk, then use those from memory. When you are done with that block, purge it and load the next 1000. This way, you spread your disk access out over time, with periods of pure memory access in between. I suspect this might not work in your situation, but, as everyone else keeps saying, you haven't really provided enough information for us to know for sure. Hopefully by throwing random ideas out there, we'll come up with one that works. :-)	[reply]
Re: Speeding a disk based hash by tilly (Archbishop) on Oct 11, 2004 at 16:42 UTC
In many situations, a BTree can be much faster than hashes. BerkeleyDB supports them as well. Also if you're exceeding Perl's memory requirements by just a bit, you can tie a hash to an in-memory database. BerkeleyDB should be more memory-efficient than Perl and so it might fit in RAM when Perl's native data structures do not. Beyond that, at Re: size on disk of tied hashes I gave an explanation of some of the performance problems with dealing with disk on large datasets, and briefly discussed some of the options that are available. Note that if you care to benchmark your application, you do not want to benchmark it with random data. Do it with a sample of real data. Disk performance is strongly affected by your access pattern, and real world access patterns are not very random (else caching would not be a good idea).	[reply]
Re: Speeding a disk based hash by diotalevi (Canon) on Oct 11, 2004 at 18:20 UTC
Please see BerkeleyDB vs. Linux file system. If you are using BerkeleyDB, make sure you tune that cache size! Use the db_stat utility and this document. -- perrin	[reply]