jfroebe has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm pretty sure the answer is no, but want to double check... I have two hashes that contain the same data but the key and the value are swapped:

$hash1{$value1} = $value2; $hash2{$value2} = $value1;

My question is whether or not I can somehow get the two hashes to occupy the exact same space.. only hash2 would look at the memory backwards. Considering there are about 300,000 key/value pairs, this is a significant amount of data, and no, putting it in a database is not viable as we are talking about roughly 80 million operations on each hash. What I'm trying to do is reduce the memory requirement.

Again, I'm doubtful this can be done, but it doesn't hurt in asking

Jason L. Froebe

Team Sybase member

No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

Replies are listed 'Best First'.
Re: two hashes occupying the same space
by BrowserUk (Patriarch) on Oct 28, 2004 at 04:17 UTC

    There is no direct way, though it might be possible to share some of the memory by dropping to XS. I have read something, somewhere about it being possible that hash key storage can be shared sometimes. Where the SV is a key in more than 1 hash, Perl will (sometimes?) reuse the same copy for both hashes

    I can't actually remember where I read about this sharing--I'll update if I find it. In any case, I don't think that applies here as the keys of one would be the values of the other and vice versa.

    A quick test shows that two hashes with 300,000 key/value pairs ("knnnn"/"vnnnn") requires around 90 MB in total, which isn't too bad on modern machines. It may require more if your values are longer than around 12 characters each.

    In this case, you may be able to save a little memory by storing references to the actual data. If a high proportion of the actual data values are say 30+ characters in length, then storing them once and putting references to them in the hashes may achieve some savings, but you'd need to experiment to find the actual breakpoints.

    The saving probably wouldn't be hugely significant unless your data values are really big. It probably wouldn't be useful unless all your lookups in one are derived from the other--Ie. If you need to look up keys input from an external source, you would have lost the benefits of the hash, by needing to first map the actual key to it's reference value before being able to look it up!. It also makes your algorithms considerably more complicated having to de-reference all the time.

    Depending on what properties of the hashes you are using, it is possible to create fast(ish) lookup tables that use less storage. You could look at Tie::SubstrHash and/or A (memory) poor man's <strike>hash</strike> lookup table..

    Update: Storing the same two hashes as took 90MB above as Tie::SubstrHashs (keysize=8, Valuesize=8, tablesize=300,000), reduces the storage requirements to around 11 MB, at the expense of slowing the look ups and (the worse part) having to pad your keys and values to the specified maximums.

    The latter part could be alleviated by writing your own Tiedhash module (or modifying Tie::SubstrHash) to perform the padding interally. I did this once, but the machine on which I did it is currently dead. It isn't too hard though. It greatly simplifies the use of the module. Maybe that would be a worthy patch to write?


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: two hashes occupying the same space
by davido (Cardinal) on Oct 28, 2004 at 05:20 UTC

    What you're asking for is a way to gain easy lookups, where the keys of hash1 refer to the same values that are the keys of hash2, and vice versa.

    I can see the logic here. Hash lookups are O(1) in speed efficiency, whereas greping for a value is O(n). But the problem is that maintaing a pair of hashes that crossreference each other gains speed at the cost of memory.

    When you find yourself in such a situation, it is often time to start thinking in terms of a relational database. Even if all you have are two columns, it makes sense to let the database engine deal with quickly locating your information for you. Let's say you have the following database table:

    name | title --------|---------------- bart | troublemaker homer | delinquent dad marge | enabling mother maggie | wise baby lisa | child prodigy

    You could perform lookups by name by using an SQL query such as this one:

    SELECT title FROM jobs WHERE name='bart'

    Or you could let the database search based on the other column instead...

    SELECT name FROM jobs WHERE title='troublemaker'

    The database engine is usually pretty good at doing lookups in very rapid time. Perl makes database work easy (or at least possible) using the DBI module. DBI is the Perl-side interface to your database. It still needs an actual database too. A very simple, easy to use, easy to install, single-file kind of database will do, and for that, DBD::SQLite is a great tool.

    I hope this alternate approach is helpful.


    Dave

      Databases aren't magic. For the database to be fast at both operations, it needs to have indexes on both fields. At which point you have something similar to hashes pointing both ways.

      The only real saving with a relational database is that its internal data format is likely more compact than Perl's, so the data will take less memory.

        But indexing both fields doesn't mean that the indexes have to be held in memory, in as memory-consuming of a way as holding two hashes in memory. Double-indexed databases do scale well, while paired hashes don't.


        Dave

      Simple suggestion: Implement your proposal and then compare the time taken to perform 80 million read/modify/write cycles across 300,000 kv pairs, with doing the same using a perl hash.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

        I don't need to test to know that the database solution will be slower than the hash solution as long as the hash fits in conventional memory, and I think you know the answer to that too. But at some point, 300_000 k/v pairs, 500_000, or maybe a million, the hash is going to cease to be a plausable option due to memory constraints. Given that the OP is asking about trying to get tandem lookup hashes to occupy the same memory space is a pretty good indication that (s)he is running into memory problems.

        One avenue that folks often take when they find that they just can't hold the whole dataset in memory is to turn to databases. Them's the breaks. Tough decisions will have to be made. Either live with the inefficiency of not having the tandem lookup hash capability, or live with the inefficiency of the overhead of a database.

        If holding the dataset in memory once is ok, but twice is impossible, then you can't have a tandem pair of lookup hashes.

        If memory isn't an issue, then the original question is pointless. But there was a point stated in the question: "What I'm trying to do is reduce the memory requirement." In retrospect, I wish I hadn't offered the database suggestion. Not because it's not a reasonable comprimise (It is reasonable) , but because the OP undoubtedly already knows his options with regards to databases, being a 'Team Sybase Member'.


        Dave

Re: two hashes occupying the same space
by fergal (Chaplain) on Oct 28, 2004 at 10:29 UTC

    If you want to save memory, build your own hash. Here's an example - you should probably replace the _hashit() function with something good. It's a pity Perl doesn't give us access to it's own hash function, a little XS trickery might solve that though.

    The point of this example is that it stores the both the key and the value as a reference, that way the memory for the strings is shared between the 2 internal hash thingys. It'll obviously be slower than a real Perl hash but should be much faster than a database. It's also missing things like delete, exists, update. These are easy enough to add if you need them.

    I think the memory savings should be pretty good, unless your strings are very short in which case the overhead of the references will undo any benefit.

    A short example of how to use it

    my $h = TwoWayHash->new(100); # 100 is the "size", ignore for now $h->insert("mam", "dad"); print $h->retrieveOneWay("L2R", "mam") # Left 2 Right # dad print $h->retrieveOneWay("R2L", "dad") # Right 2 Left # mam

    If you wanted to make this look like a genuine Perl hash you will have to write another class to handle the tie interface and then tie 2 hashes like this

    tie %r2l, $twowayhash, "R2L"; # right to left tie %l2r, $twowayhash, "L2R"; # left to right
    This will slow it down even further.
    use strict; use warnings; package TwoWayHash; sub new { my $pkg = shift; my $size = shift; my $self = bless {Size => $size}, $pkg; # preextend the buckets $self->{R2L} = []; $#{$self->{R2L}} = $size - 1; $self->{L2R} = []; $#{$self->{L2R}} = $size - 1; return $self; } sub insert { my $self = shift; my ($v1, $v2) = @_; $self->insertOneWay("L2R", \$v1, \$v2); $self->insertOneWay("R2L", \$v2, \$v1); } sub insertOneWay { my $self = shift; my $which = shift; my $key = shift; # these are refs to the strings my $value = shift; my $h = _hashit($key) % $self->{Size}; # turn the key into somethi +ng much smaller and hopefully unique my $buckets = $self->{$which}; # find the already existing bucket or create a # new one then put the key/value into the bucket my $bucket = $buckets->[$h] ||= []; push(@$bucket, $key, $value); } sub _hashit { # a terrible hash function! It just takes the first 2 characters of th +e string # and turns them into a number between 0 and 65535 # You'll need something better for real life my $rstr = shift; return unpack("S", substr($$rstr, 0, 2)); } sub retrieveOneWay { my $self = shift; my $which = shift; my $key = shift; # not a ref this time my $h = _hashit(\$key) % $self->{Size}; my $buckets = $self->{$which}; if (defined(my $list = $buckets->[$h])) { # go through the list of pairs of key/values and see if any ar +e the ones # we want my $i = 0; while ($i < $#$list) { if ($key eq ${$list->[$i]}) { return ${$list->[$i + 1]}; } $i += 2; } } # if we get here then we didn't find anything nice return undef; } 1;
    some code to test it
    use strict; use warnings; use Test::More 'no_plan'; use Test::NoWarnings; # comment out if you don't have this installed use TwoWayHash; my $h = TwoWayHash->new(100); my @tests = ( ["abcd123", "efgh123"], ["wxyz456", "lmno456"], ["abcd456", "efgh456"], ); foreach my $test (@tests) { my ($v1, $v2) = @$test; $h->insert($v1, $v2); } use Data::Dumper; #print Dumper(\$h); foreach my $test (@tests) { my ($v1, $v2) = @$test; is($h->retrieveOneWay("L2R", $v1), $v2, "L2R retrieve '$v1'"); is($h->retrieveOneWay("L2R", $v2), undef, "L2R don't retrieve '$v2 +'"); is($h->retrieveOneWay("R2L", $v2), $v1, "R2L retrieve '$v2'"); is($h->retrieveOneWay("R2L", $v1), undef, "R2L don't retrieve '$v1 +'"); } is($h->retrieveOneWay("L2R", "abcd234"), undef, "L2R don't retrieve");

    Creating a new TwoWayHash requires 1 argument, the size. This is the number of buckets to allocate. In a hash, too few buckets means that each bucket will have many elements in it and so when it comes time to retrieve an element from that bucket, you might need to search through lots of other elements in the bucket.

    The number of buckets is fixed, the right size depends on your dataset. Perl's hashes are smarter, they can increase the number of buckets when it's necessary.

Re: two hashes occupying the same space
by Zaxo (Archbishop) on Oct 28, 2004 at 04:06 UTC

    You can easily combine the two by inverting the first with a hash slice, @hash1{values %hash1} = keys %hash1; Be aware that you will lose some keys if there are duplicate values.

    After Compline,
    Zaxo

      I think the OP is trying to have $value1 only once in memory, and $value2 only once in memory, in order to reduce memory usage. Sharing the same hash doesn't reduce memory usage, and constructing it in that manner is not very memory efficient.

      Don't believe that's what he wants, that sounds too easy to me.

      He obviously needs both hash for search in both directions, and he cannot lose the original hash. Unless the the original hash is only useful for the first phase of the execution, and the reversed one is only needed for the second phase.