millions of records in a Hash

johnkj has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re (tilly) 1: millions of records in a Hash
by tilly (Archbishop) on Feb 24, 2002 at 18:09 UTC

However assuming that your problem as stated is indeed true, I would first check whether the file DB_File is working with has exceeded the limitations of your operating system and or file system for large files. That would generally be likely if the file was about 2 GB. If so then try upgrading your operating system. Current versions of most operating systems should handle files of hundreds of terabytes.

Based on the numbers you have given though, I suspect that you have hit a different limit. If your machine is 32-bit (if it is on Intel hardware then it almost assuredly is) then I would wonder whether somewhere within DB_File it keeps track of a pointer itself, and that limits the size of file it can address. (Berkeley DB itself has no such size limit so it would be in the interface.) My first shot would be to try the newer BerkeleyDB and then report the bug to Paul Marquess, with a short program that produces the bad data set on your system, along with my guess as to the problem. (If there is a difference in behaviour, be sure to tell him that as well.) Only do this if you are not hitting a limit at 2 GB. He knows all about the 2 GB limit, it isn't his bug, and the only thing he can tell you is to upgrade.

If you are hitting large file limits, you can still get around them but it will be slower. What you need to do is sit down with perltie and figure out how to write your own tied interface. For instance you could have 4 dbms on disk, and use ord($key) & 3 to figure out which one a given key/value pair was going into. Now since each one is only getting 1/4 of the data, none of them will hit the size issues.

[reply]
[d/l]

(crazyinsomniac) Re: Re (tilly) 1: millions of records in a Hash

by crazyinsomniac (Prior) on Feb 25, 2002 at 07:59 UTC

Tie::DB_File::SplitHash is designed for support of file size limitted OSes. Transparently splits a DB_File database into as many distinct files as desired. Distributes hash entries between the files using a randomization algorithm. Has the effect of allowing DB_File hashes to grow to the full size of the partition. Requires 'Digest::SHA1' and 'DB_File' to be installed.

______crazyinsomniac_____________________________
Of all the things I've lost, I miss my mind the most.
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

[reply]

Re: millions of records in a Hash
by Speedy (Monk) on Feb 24, 2002 at 17:25 UTC

If I understand what you are attempting, you want to check the values of the entire hash in the event that you start to enter a new key $a and discover that a key $a already exists in %Hash. Right now you are trying to store the values for each hash key in an array, then run through this long array.

Would it work to in advance create an "inverted" hash, stored under another variable name, like %Hash_values, and store the results in an appropriate file. You don't seem to care what key goes with the values, but rather want a quick way to see if the value exists. You could for example use Recipe 5.8 in the Perl Cookbook to initially create the %Hash_values, which has as its KEYS the Values of %Hash, then when doing the checks tie BOTH %Hash and %Hash_values.

The code could look like:

tie %Hash, 'DB_File', $path_to_hash, O_RDWR|O_CREAT, 0666 or die "Can'
+t tie $path_to_hash: $!";

tie %Hash_values, 'DB_File', $path_to_hash_values, O_RDWR|O_CREAT, 066
+6 or die "Can't tie $path_to_hash_values: $!";

while (($key, $value) = each (%Hash)) {
    $Hash_values{$value} = $key;    # Or any value for $key
   }

untie %Hash;
untie %Hash_values;
[download]

Then rather than doing a long foreach loop through an array, you would simply ask "if (exists $Hash_values{$b}) { -- do stuff -- }

If you find $b does not exist in the list of values (which now are the KEYS to %Hash_values), you can simply add it to %Hash_values to keep the running list of values (perhaps like $Hash_values{$b} = 1; -- since you don't care what the "value" of the value is, but just need a quick way to look up the values).

[reply]
[d/l]

Re: millions of records in a Hash
by jeffenstein (Hermit) on Feb 24, 2002 at 18:18 UTC

You mention trying DB_File to store the hash. Did you use your posted with DB_File?

If you used your posted code, none of the data would have actually ended up in the database. What you would want to do is to use split/join (or Storable for that matter) to turn your array into a string, and store that in the database, something like this:

my $data = $Hash{$a};
my @elt = split /::/, $data;

# Do stuff with @elt here.

# push @elt, $new_data  to add data

$data = join '::', @elt;
$Hash{$a} = $data;
[download]

This way, your database won't be just the reference to the data in memory, your data will actually reside on disk.

[reply]
[d/l]

Re: millions of records in a Hash
by Buggs (Acolyte) on Feb 24, 2002 at 18:03 UTC

[reply]

Re: millions of records in a Hash
by joealba (Hermit) on Feb 25, 2002 at 02:43 UTC

Update:

[reply]

Re (tilly) 2: millions of records in a Hash

by tilly (Archbishop) on Feb 25, 2002 at 05:38 UTC

The key win of relational databases is that they allow people to store, manage, manipulate and flexibly query data structures without having to think in detail about algorithms. If I was managing a few hundred records and needed to do things like easily find what classes Johnny took, I would be inclined to use a relational database for those features. And if I had a good one, it would grow with me if I needed to handle millions of records without my having to get into some heavy-duty wizardry.

However the problem of efficient storage and access to data is independent of the data management structure of a relational database. That is the role of a dbm. The technologies in dbms are buried somewhere inside of traditional relational databases. (MySQL can actually run on top of Berkeley DB.) But sometimes your structure is simple enough that you can manage it yourself, and in that case there is no reason not to remove the relational layer. Large amounts of data is not the same as complex data.

(There are also other management strategies than relational, for instance object oriented databases like Gemstone.)

[reply]

Re: Re (tilly) 2: millions of records in a Hash

by Buggs (Acolyte) on Feb 25, 2002 at 06:20 UTC

[reply]

Re (tilly) 4: millions of records in a Hash

by tilly (Archbishop) on Feb 25, 2002 at 07:04 UTC

Re: Re (tilly) 4: millions of records in a Hash

by Buggs (Acolyte) on Feb 25, 2002 at 19:51 UTC

Some notes below your chosen depth have not been shown here

Re: Re (tilly) 2: millions of records in a Hash

by joealba (Hermit) on Feb 25, 2002 at 19:01 UTC

tilly

relational

[reply]

Re (tilly) 4: millions of records in a Hash

by tilly (Archbishop) on Feb 26, 2002 at 01:36 UTC

Re: Re (tilly) 2: millions of records in a Hash

by johnkj (Initiate) on Mar 06, 2002 at 21:08 UTC

Thanks tilly, for your advice. I had been trying to load a simple %hash variable with the key/value pairs and I have to take care of the duplicates. The key/value pairs exist in a oracle db, its not indexed on the key. wud hitting the db using dbi module be more efficient than trying to load up a %hash? I am using a monster dec alpha box with atleast 3 gb ram.

[reply]

Re: Re: Re (tilly) 2: millions of records in a Hash

by mpeppler (Vicar) on Mar 07, 2002 at 01:26 UTC

Re: Re: millions of records in a Hash

by johnkj (Initiate) on Mar 06, 2002 at 21:13 UTC

hi joelba, i have a 12 byte key and a 15 byte value, that i am trying to store in a %hash variable. the key is alphanumeric, the value is numeric. i think i might have hit some kinda memory limit. hard to imagine though i am using a box with 16gb virutal memory size and 3gb ram. so shudnt have had any issues but looks like it does. its a dec alpha box. been on other issues am now back to tackling this issue.. any wisdom will be gladly accepted..thanks in advance

[reply]