Re: two hashes occupying the same space

What you're asking for is a way to gain easy lookups, where the keys of hash1 refer to the same values that are the keys of hash2, and vice versa.

I can see the logic here. Hash lookups are O(1) in speed efficiency, whereas greping for a value is O(n). But the problem is that maintaing a pair of hashes that crossreference each other gains speed at the cost of memory.

When you find yourself in such a situation, it is often time to start thinking in terms of a relational database. Even if all you have are two columns, it makes sense to let the database engine deal with quickly locating your information for you. Let's say you have the following database table:

  name  |  title
--------|----------------
 bart   | troublemaker
 homer  | delinquent dad
 marge  | enabling mother
 maggie | wise baby
 lisa   | child prodigy
[download]

You could perform lookups by name by using an SQL query such as this one:

SELECT title FROM jobs WHERE name='bart'

Or you could let the database search based on the other column instead...

SELECT name FROM jobs WHERE title='troublemaker'

The database engine is usually pretty good at doing lookups in very rapid time. Perl makes database work easy (or at least possible) using the DBI module. DBI is the Perl-side interface to your database. It still needs an actual database too. A very simple, easy to use, easy to install, single-file kind of database will do, and for that, DBD::SQLite is a great tool.

I hope this alternate approach is helpful.

Dave

Comment on Re: two hashes occupying the same space Select or Download Code

Replies are listed 'Best First'.
Re^2: two hashes occupying the same space by tilly (Archbishop) on Oct 28, 2004 at 05:26 UTC
Databases aren't magic. For the database to be fast at both operations, it needs to have indexes on both fields. At which point you have something similar to hashes pointing both ways. The only real saving with a relational database is that its internal data format is likely more compact than Perl's, so the data will take less memory.	[reply]
Re^3: two hashes occupying the same space by davido (Cardinal) on Oct 28, 2004 at 05:30 UTC
But indexing both fields doesn't mean that the indexes have to be held in memory, in as memory-consuming of a way as holding two hashes in memory. Double-indexed databases do scale well, while paired hashes don't. Dave	[reply]
Re^4: two hashes occupying the same space by tilly (Archbishop) on Oct 28, 2004 at 05:35 UTC
That is a good point. However as soon as the indexes are not entirely held in memory you get a performance dropoff. And the relational database itself already is a performance decrease over a straight hash. TANSTAAFL. Pick the trade-off which is most appropriate for what you're doing.	[reply]
Re^5: two hashes occupying the same space by davido (Cardinal) on Oct 28, 2004 at 05:45 UTC
Re^2: two hashes occupying the same space by BrowserUk (Patriarch) on Oct 28, 2004 at 05:36 UTC
Simple suggestion: Implement your proposal and then compare the time taken to perform 80 million read/modify/write cycles across 300,000 kv pairs, with doing the same using a perl hash. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]
Re^3: two hashes occupying the same space by davido (Cardinal) on Oct 28, 2004 at 06:00 UTC
I don't need to test to know that the database solution will be slower than the hash solution as long as the hash fits in conventional memory, and I think you know the answer to that too. But at some point, 300_000 k/v pairs, 500_000, or maybe a million, the hash is going to cease to be a plausable option due to memory constraints. Given that the OP is asking about trying to get tandem lookup hashes to occupy the same memory space is a pretty good indication that (s)he is running into memory problems. One avenue that folks often take when they find that they just can't hold the whole dataset in memory is to turn to databases. Them's the breaks. Tough decisions will have to be made. Either live with the inefficiency of not having the tandem lookup hash capability, or live with the inefficiency of the overhead of a database. If holding the dataset in memory once is ok, but twice is impossible, then you can't have a tandem pair of lookup hashes. If memory isn't an issue, then the original question is pointless. But there was a point stated in the question: "What I'm trying to do is reduce the memory requirement." In retrospect, I wish I hadn't offered the database suggestion. Not because it's not a reasonable comprimise (It is reasonable) , but because the OP undoubtedly already knows his options with regards to databases, being a 'Team Sybase Member'. Dave	[reply]