johnkj has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to load a hash table with 150 million key-value pairs.. my key is of length 12 bytes and my value 15. I have tried using DB_File. I cant seem to get it to load everything. Always balks after around 111 million.. any ideas?
if( $a ne " " ) { if( !exists $Hash{$a} ) { $Hash{$a} = $b; } else { $is_there = 0; $tmp_is_there = 0; $temp = $Hash{$a}; foreach my $elt ( @{$Hash{$a}} ) { if ($elt eq $b) { $is_there = 1; } if ($elt eq $temp) { $tmp_is_there = 1; } } if ( !$is_there ) { push @{$Hash{$a}}, $b; } if ( !$tmp_is_there ) { push @{$Hash{$a}}, $temp; } } }

Replies are listed 'Best First'.
Re (tilly) 1: millions of records in a Hash
by tilly (Archbishop) on Feb 24, 2002 at 18:09 UTC
    First let me note that the code that you have given cannot be for your stated problem. For one thing it is a syntax error. For a second you are using the hash values in a DBM as if they were anonymous arrays. That generally won't work due to stringification.

    However assuming that your problem as stated is indeed true, I would first check whether the file DB_File is working with has exceeded the limitations of your operating system and or file system for large files. That would generally be likely if the file was about 2 GB. If so then try upgrading your operating system. Current versions of most operating systems should handle files of hundreds of terabytes.

    Based on the numbers you have given though, I suspect that you have hit a different limit. If your machine is 32-bit (if it is on Intel hardware then it almost assuredly is) then I would wonder whether somewhere within DB_File it keeps track of a pointer itself, and that limits the size of file it can address. (Berkeley DB itself has no such size limit so it would be in the interface.) My first shot would be to try the newer BerkeleyDB and then report the bug to Paul Marquess, with a short program that produces the bad data set on your system, along with my guess as to the problem. (If there is a difference in behaviour, be sure to tell him that as well.) Only do this if you are not hitting a limit at 2 GB. He knows all about the 2 GB limit, it isn't his bug, and the only thing he can tell you is to upgrade.

    If you are hitting large file limits, you can still get around them but it will be slower. What you need to do is sit down with perltie and figure out how to write your own tied interface. For instance you could have 4 dbms on disk, and use ord($key) & 3 to figure out which one a given key/value pair was going into. Now since each one is only getting 1/4 of the data, none of them will hit the size issues.

Re: millions of records in a Hash
by Speedy (Monk) on Feb 24, 2002 at 17:25 UTC

    If I understand what you are attempting, you want to check the values of the entire hash in the event that you start to enter a new key $a and discover that a key $a already exists in %Hash. Right now you are trying to store the values for each hash key in an array, then run through this long array.

    Would it work to in advance create an "inverted" hash, stored under another variable name, like %Hash_values, and store the results in an appropriate file. You don't seem to care what key goes with the values, but rather want a quick way to see if the value exists. You could for example use Recipe 5.8 in the Perl Cookbook to initially create the %Hash_values, which has as its KEYS the Values of %Hash, then when doing the checks tie BOTH %Hash and %Hash_values.

    The code could look like:

    tie %Hash, 'DB_File', $path_to_hash, O_RDWR|O_CREAT, 0666 or die "Can' +t tie $path_to_hash: $!"; tie %Hash_values, 'DB_File', $path_to_hash_values, O_RDWR|O_CREAT, 066 +6 or die "Can't tie $path_to_hash_values: $!"; while (($key, $value) = each (%Hash)) { $Hash_values{$value} = $key; # Or any value for $key } untie %Hash; untie %Hash_values;

    Then rather than doing a long foreach loop through an array, you would simply ask "if (exists $Hash_values{$b}) { -- do stuff -- }

    If you find $b does not exist in the list of values (which now are the KEYS to %Hash_values), you can simply add it to %Hash_values to keep the running list of values (perhaps like $Hash_values{$b} = 1; -- since you don't care what the "value" of the value is, but just need a quick way to look up the values).

Re: millions of records in a Hash
by jeffenstein (Hermit) on Feb 24, 2002 at 18:18 UTC

    You mention trying DB_File to store the hash. Did you use your posted with DB_File?

    If you used your posted code, none of the data would have actually ended up in the database. What you would want to do is to use split/join (or Storable for that matter) to turn your array into a string, and store that in the database, something like this:

    my $data = $Hash{$a}; my @elt = split /::/, $data; # Do stuff with @elt here. # push @elt, $new_data to add data $data = join '::', @elt; $Hash{$a} = $data;

    This way, your database won't be just the reference to the data in memory, your data will actually reside on disk.

Re: millions of records in a Hash
by Buggs (Acolyte) on Feb 24, 2002 at 18:03 UTC
    If you don't need the whole hash in memory at the same time,
    and i assume you don't
    then let the database do the work.
    Like this pseudo code:
    SELECT FROM foo WHERE (bar=$a);
    found a dataset? UPDATE else CREATE
Re: millions of records in a Hash
by joealba (Hermit) on Feb 25, 2002 at 02:43 UTC
    Okay.. I'll bite.. Why on Earth would you load a hash with 150 million key/value pairs!? Software packages like Oracle and MySQL were developed for one major reason: Storing large amounts of data for optimal retrieval requires lots of code tricks.

    I'll assume you have a good reason and move on. :) The size of a hash is actually not infinite -- As you've found, the hash gets unusable after a certain size. I know there's a module out there called something like *::BigHash, but I can't find anything on it. Am I imagining things, or do any other monks know about this module?

    Update: Cool! I'm learning about the specifics of DB_File now because of your question. Thanks! Anyway, maybe you need to use a different hashing algorithm? Can you tell if your program is bound by CPU cycles, Disk I/O, etc. when it begins to fail? Have you tried toying around with cachesizes on your DB_Hash? What about the different key sizes (if your data can relevantly be stored as a BTREE)? Can you give us a little more info about the data so we can help you out? And, what kind of computer are you using that can handle all this data!?!?

      You appear to have a common misunderstanding about the reason that relational databases exist.

      The key win of relational databases is that they allow people to store, manage, manipulate and flexibly query data structures without having to think in detail about algorithms. If I was managing a few hundred records and needed to do things like easily find what classes Johnny took, I would be inclined to use a relational database for those features. And if I had a good one, it would grow with me if I needed to handle millions of records without my having to get into some heavy-duty wizardry.

      However the problem of efficient storage and access to data is independent of the data management structure of a relational database. That is the role of a dbm. The technologies in dbms are buried somewhere inside of traditional relational databases. (MySQL can actually run on top of Berkeley DB.) But sometimes your structure is simple enough that you can manage it yourself, and in that case there is no reason not to remove the relational layer. Large amounts of data is not the same as complex data.

      (There are also other management strategies than relational, for instance object oriented databases like Gemstone.)

        You are sure right that DBMs, like BDB, seem to fit the problem well.
        On the otherhand maybe the seeker needs multiuser access or wants to remotely connect to his data.
        We don't know that.
        After all RDBMs provide much more services and usually fit smaller problems too, often also avoiding scaling problems.
        So the general advise is often to use RDBMs where DBMs would suffice, that is not always based on a misunderstanding, maybe even more often based on the simple fact that people deal more with RDBMs than DBMs.
        People are running multiuser operating systems for mere desktop usage after all. And many are happy with that, even so a single user system would suffice.

        Another approach that often is forgotten is to write your own storage method, which given the seekers description doesn't seem to be out of hand and could well result in the most performant solution.
        Good point, tilly. The relational aspects of Oracle and MySQL may not help (depending on the data). But, I assume that database packages like these use more optimal storage than DB_File, which may work better for such large data sets. Am I wrong?
        Thanks tilly, for your advice. I had been trying to load a simple %hash variable with the key/value pairs and I have to take care of the duplicates. The key/value pairs exist in a oracle db, its not indexed on the key. wud hitting the db using dbi module be more efficient than trying to load up a %hash? I am using a monster dec alpha box with atleast 3 gb ram.
      hi joelba, i have a 12 byte key and a 15 byte value, that i am trying to store in a %hash variable. the key is alphanumeric, the value is numeric. i think i might have hit some kinda memory limit. hard to imagine though i am using a box with 16gb virutal memory size and 3gb ram. so shudnt have had any issues but looks like it does. its a dec alpha box. been on other issues am now back to tackling this issue.. any wisdom will be gladly accepted..thanks in advance