Re (tilly) 1: millions of records in a Hash
by tilly (Archbishop) on Feb 24, 2002 at 18:09 UTC
|
First let me note that the code that you have given
cannot be for your stated problem. For one thing it is
a syntax error. For a second you are using the hash values
in a DBM as if they were anonymous arrays. That generally
won't work due to stringification.
However assuming that your problem as stated is indeed true,
I would first check whether the file DB_File is working
with has exceeded the limitations of your operating system
and or file system for large files. That would generally
be likely if the file was about 2 GB. If so then try
upgrading your operating system. Current versions of most
operating systems should handle files of hundreds of
terabytes.
Based on the numbers you have given though, I suspect that
you have hit a different limit. If your machine is 32-bit
(if it is on Intel hardware then it almost assuredly is)
then I would wonder whether somewhere within DB_File it
keeps track of a pointer itself, and that limits the size
of file it can address. (Berkeley DB itself has no such
size limit so it would be in the interface.) My first
shot would be to try the newer BerkeleyDB and then
report the bug to Paul Marquess, with a short program that
produces the bad data set on your system, along with my
guess as to the problem. (If there is a difference in
behaviour, be sure to tell him that as well.) Only
do this if you are not hitting a limit at 2 GB. He knows
all about the 2 GB limit, it isn't his bug, and the only
thing he can tell you is to upgrade.
If you are hitting large file limits, you can still get
around them but it will be slower. What you need to do is
sit down with perltie and figure out how to write your
own tied interface. For instance you could have 4 dbms on
disk, and use ord($key) & 3 to figure out which
one a given key/value pair was going into. Now since each
one is only getting 1/4 of the data, none of them will hit
the size issues. | [reply] [d/l] |
|
|
| [reply] |
Re: millions of records in a Hash
by Speedy (Monk) on Feb 24, 2002 at 17:25 UTC
|
If I understand what you are attempting, you want to check the values of the entire hash in the
event that you start to enter a new key $a and discover that a key $a already exists in %Hash. Right now you
are trying to store the values for each hash key in an array, then run through this long array.
Would it work to in advance create an "inverted" hash, stored under another variable name,
like %Hash_values, and store the results in an appropriate file. You don't seem to care what key goes
with the values, but rather want a quick way to see if the value exists. You could for example
use Recipe 5.8 in the Perl Cookbook to initially create the %Hash_values, which has as its KEYS the Values of %Hash, then when doing the
checks tie BOTH %Hash and %Hash_values.
The code could look like:
tie %Hash, 'DB_File', $path_to_hash, O_RDWR|O_CREAT, 0666 or die "Can'
+t tie $path_to_hash: $!";
tie %Hash_values, 'DB_File', $path_to_hash_values, O_RDWR|O_CREAT, 066
+6 or die "Can't tie $path_to_hash_values: $!";
while (($key, $value) = each (%Hash)) {
$Hash_values{$value} = $key; # Or any value for $key
}
untie %Hash;
untie %Hash_values;
Then rather than doing a long foreach loop through an array, you would simply ask "if (exists $Hash_values{$b}) { -- do stuff -- }
If you find $b does not exist in the list of values (which now are the KEYS to %Hash_values), you can
simply add it to %Hash_values to keep the running
list of values (perhaps like $Hash_values{$b} = 1; -- since you don't care what the "value" of the value is, but just need a quick way to look up the values).
| [reply] [d/l] |
Re: millions of records in a Hash
by jeffenstein (Hermit) on Feb 24, 2002 at 18:18 UTC
|
You mention trying DB_File to store the hash. Did you use your posted with DB_File?
If you used your posted code, none of the data would have actually ended up in the database. What you would want to do is to use split/join (or Storable for that matter) to turn your array into a string, and store that in the database, something like this:
my $data = $Hash{$a};
my @elt = split /::/, $data;
# Do stuff with @elt here.
# push @elt, $new_data to add data
$data = join '::', @elt;
$Hash{$a} = $data;
This way, your database won't be just the reference to the data in memory, your data will actually reside on disk.
| [reply] [d/l] |
Re: millions of records in a Hash
by Buggs (Acolyte) on Feb 24, 2002 at 18:03 UTC
|
If you don't need the whole hash in memory at the same time,
and i assume you don't
then let the database do the work.
Like this pseudo code:
SELECT FROM foo WHERE (bar=$a);
found a dataset? UPDATE else CREATE
| [reply] |
Re: millions of records in a Hash
by joealba (Hermit) on Feb 25, 2002 at 02:43 UTC
|
Okay.. I'll bite.. Why on Earth would you load a hash with 150 million key/value pairs!? Software packages like Oracle and MySQL were developed for one major reason: Storing large amounts of data for optimal retrieval requires lots of code tricks.
I'll assume you have a good reason and move on. :) The size of a hash is actually not infinite -- As you've found, the hash gets unusable after a certain size. I know there's a module out there called something like *::BigHash, but I can't find anything on it. Am I imagining things, or do any other monks know about this module?
Update: Cool! I'm learning about the specifics of DB_File now because of your question. Thanks! Anyway, maybe you need to use a different hashing algorithm? Can you tell if your program is bound by CPU cycles, Disk I/O, etc. when it begins to fail? Have you tried toying around with cachesizes on your DB_Hash? What about the different key sizes (if your data can relevantly be stored as a BTREE)? Can you give us a little more info about the data so we can help you out? And, what kind of computer are you using that can handle all this data!?!?
| [reply] |
|
|
You appear to have a common misunderstanding about the
reason that relational databases exist.
The key win of relational databases is that they allow
people to store, manage, manipulate and flexibly query
data structures without having to think in detail about
algorithms. If I was managing a few hundred records and
needed to do things like easily find what classes Johnny
took, I would be inclined to use a relational database for
those features. And if I had a good one, it would grow
with me if I needed to handle millions of records without
my having to get into some heavy-duty wizardry.
However the problem of efficient storage and access to
data is independent of the data management structure of a
relational database. That is the role of a dbm. The
technologies in dbms are buried somewhere inside of
traditional relational databases. (MySQL can actually run
on top of Berkeley DB.) But sometimes your structure is
simple enough that you can manage it yourself, and in that
case there is no reason not to remove the relational
layer. Large amounts of data is not the same as complex
data.
(There are also other management strategies than relational,
for instance object oriented databases like Gemstone.)
| [reply] |
|
|
You are sure right that DBMs, like BDB, seem to fit the problem well.
On the otherhand maybe the seeker needs multiuser access or wants to remotely connect to his data.
We don't know that.
After all RDBMs provide much more services and usually fit smaller problems too, often also avoiding scaling problems.
So the general advise is often to use RDBMs where DBMs would suffice, that is not always based on a misunderstanding, maybe even more often based on the simple fact that people deal more with RDBMs than DBMs.
People are running multiuser operating systems for mere desktop usage after all. And many are happy with that, even so a single user system would suffice.
Another approach that often is forgotten is to write your own storage method, which given the seekers description doesn't seem to be out of hand and could well result in the most performant solution.
| [reply] |
|
|
|
|
|
|
|
Good point, tilly. The relational aspects of Oracle and MySQL may not help (depending on the data). But, I assume that database packages like these use more optimal storage than DB_File, which may work better for such large data sets. Am I wrong?
| [reply] |
|
|
|
|
Thanks tilly, for your advice. I had been trying to load a simple %hash variable with the key/value pairs and I have to take care of the duplicates. The key/value pairs exist in a oracle db, its not indexed on the key. wud hitting the db using dbi module be more efficient than trying to load up a %hash? I am using a monster dec alpha box with atleast 3 gb ram.
| [reply] |
|
|
|
|
hi joelba,
i have a 12 byte key and a 15 byte value, that i am trying to store in a %hash variable. the key is alphanumeric, the value is numeric. i think i might have hit some kinda memory limit. hard to imagine though i am using a box with 16gb virutal memory size and 3gb ram. so shudnt have had any issues but looks like it does. its a dec alpha box. been on other issues am now back to tackling this issue.. any wisdom will be gladly accepted..thanks in advance
| [reply] |