hbm has asked for the wisdom of the Perl Monks concerning the following question:

I have a hash of about 40M keys and undef'd values. It has outgrown RAM, so I've tied it to a file. Now, I'm finding some keys with each that don't match with exists. How can that be?!

use strict; use warnings; use DB_File; tie (my %H, "DB_File", "audit.db", O_RDWR|O_CREAT, 0666) or die $!; my $doi = '10.1353/ham.2005.0020'; print "|$doi| does ", (exists $H{$doi} ? "exist!\n" : "not exist!\n"); while(my $k = each %H) { if ($k eq $doi) { print "|$k| eq |$doi|\n"; last; # spare me the other millions... } } untie %H;

Output:

|10.1353/ham.2005.0020| does not exist! |10.1353/ham.2005.0020| eq |10.1353/ham.2005.0020|

Update: I should mention, this is Perl 5.8.3 and DB_File 1.808.

Update: I changed my loop to the following, and killed it after 30 minutes with over 9M keys not existing.

while(my $k = each %H) { print "$k does not exist\n" if (!exists $H{$k}); }

Update: Fixed with DB_File 1.82!

Replies are listed 'Best First'.
Re: with tied hash, 'each' gives key that doesn't 'exists'
by JavaFan (Canon) on Sep 15, 2010 at 21:38 UTC
    I should mention, this is Perl 5.8.3 and DB_File 1.808
    5.8.3 was released in January 2004. Perl is now up to 5.12.2, which comes with DB_File 1.818 (1.820 is out on CPAN). Could you check if the issue still occurs with a more recent Perl/DB_File? Make sure you've installed the newest version of Berkely DB itself as well. It might very well be that you've stumbled upon a bug that has been fixed sometime in the past 6 years.

      I'm trying DB_File 1.82 now. I'm building the database anew, which will take a couple more hours to complete.

      Do you know, should I have been able to open with DB_File 1.82 a db that was created with 1.808? I assume not, as I got 'file exists' errors, even with O_RDWR|O_CREAT.

      Thanks!

      Update: It's quite clear, I can only open a db with the version of DB_File that created the db. (And that's fine.)

Re: with tied hash, 'each' gives key that doesn't 'exists'
by ikegami (Patriarch) on Sep 15, 2010 at 21:18 UTC
    exists and each are handled by separate tied methods. There could be a bug in DB_File.
Re: with tied hash, 'each' gives key that doesn't 'exists'
by Khen1950fx (Canon) on Sep 15, 2010 at 19:43 UTC
    I rearranged things a little...
    #!/usr/bin/perl use strict; use warnings; use DB_File; my $a = new DB_File::HASHINFO; my $doi = '10.1353/ham.2005.0020'; tie my %H, "DB_File", "audit.db", O_RDWR|O_CREAT, 0666, $a; print "|$doi| does ", (exists $H{$doi} ? "exist!\n" : "not exist!\n"); while(my $k = each %H) { if ($k eq $doi) { print "|$k| eq |$doi|\n"; last; # spare me the other millions... } } untie %H;

      Thanks, but that had no effect.

Re: with tied hash, 'each' gives key that doesn't 'exists'
by furry_marmot (Pilgrim) on Sep 16, 2010 at 16:22 UTC

    Since you're not using the value, have you considered using keys instead of each? Maybe there's a bug in the DB_File implementation of each that is not in the keys implementation.

    Also, have you considered using Data::Dumper or YAML to dump the hash to text and peruse it in your favorite editor? Sometimes patterns emerge just from getting a look at the data.

    --marmot

      FYI, keys would be very inappropriate in this case, because it would read the file twice ... build a complete list of the record keys in RAM (virtual memory) ... then retrieve the records by those keys.   Most of the time you would run out of RAM.   But even if you didn’t, you’d be doing a rather massively too-much amount of unnecessary work.

      each, on the other hand, simply walks through the file.   There is no memory-footprint to speak of.   Furthermore, it always does so in ascending order by key.   Very often, that is exactly what you want.   (What bearing if any this may have on this issue, I don’t know.)

        It is not generally true that each returns keys (or key/value pairs) in ascending order by key. It is true for some base classes that one can tie a hash to, but not for DB_File which is what the OP is asking about. See the description of each in Programing Perl 3rd Edition p. 703.

      I run out of memory with keys, hence each.

      I'll consider your other suggestions, but first I'm testing the latest DB_File.

      Thanks!