Re (tilly) 2: millions of records in a Hash
by tilly (Archbishop) on Feb 25, 2002 at 05:38 UTC
|
You appear to have a common misunderstanding about the
reason that relational databases exist.
The key win of relational databases is that they allow
people to store, manage, manipulate and flexibly query
data structures without having to think in detail about
algorithms. If I was managing a few hundred records and
needed to do things like easily find what classes Johnny
took, I would be inclined to use a relational database for
those features. And if I had a good one, it would grow
with me if I needed to handle millions of records without
my having to get into some heavy-duty wizardry.
However the problem of efficient storage and access to
data is independent of the data management structure of a
relational database. That is the role of a dbm. The
technologies in dbms are buried somewhere inside of
traditional relational databases. (MySQL can actually run
on top of Berkeley DB.) But sometimes your structure is
simple enough that you can manage it yourself, and in that
case there is no reason not to remove the relational
layer. Large amounts of data is not the same as complex
data.
(There are also other management strategies than relational,
for instance object oriented databases like Gemstone.) | [reply] |
|
|
You are sure right that DBMs, like BDB, seem to fit the problem well.
On the otherhand maybe the seeker needs multiuser access or wants to remotely connect to his data.
We don't know that.
After all RDBMs provide much more services and usually fit smaller problems too, often also avoiding scaling problems.
So the general advise is often to use RDBMs where DBMs would suffice, that is not always based on a misunderstanding, maybe even more often based on the simple fact that people deal more with RDBMs than DBMs.
People are running multiuser operating systems for mere desktop usage after all. And many are happy with that, even so a single user system would suffice.
Another approach that often is forgotten is to write your own storage method, which given the seekers description doesn't seem to be out of hand and could well result in the most performant solution.
| [reply] |
|
|
This is all true. But I am still cautious about telling
people to use an RDBM when they either don't have the
background to understand one (and I don't have energy
to teach that), or they might understand both and have a
good reason for using a dbm.
As for writing your own storage method, I would strongly
discourage people from doing that unless they already know,
for instance, the internals of how a dbm works. And if
someone comes back and asks me for that, my response will
be that if they have to ask, the odds are that I can't
teach them enough about the subject to do any better than
they can do just by using the already existing wheel. And
this is definitely true if they think they can build
their wheel in Perl.
| [reply] |
|
|
|
|
|
|
|
|
|
|
Good point, tilly. The relational aspects of Oracle and MySQL may not help (depending on the data). But, I assume that database packages like these use more optimal storage than DB_File, which may work better for such large data sets. Am I wrong?
| [reply] |
|
|
With significant caveats, yes, you are wrong.
The key to handling large data sets is to have efficient
data structures and algorithms. A dbm does this. Given
that there aren't significantly better ones available, a
relational database cannot improve much. Oh,
sometimes it might be possible to get a slight win from a
known fixed structure. Mostly if you do, you lose it
several times over from having the relational database.
(Particularly if, as is usually the case, you have the
database code in another process that your process needs
to communicate with.)
However change the picture by saying that you don't
have a key/value relationship, but rather have a tabular
structure where you want to be able to do quick lookups on
either of two different fields. Stuff that into any
decent relational database, add an index on those two
fields. Done. What do you have to do to get that with
dbms? Well, you could store your data in a long linear
file, and then store offsets in a key/value relationship
in a couple of dbms (one for each index). This is a lot
of custom code to duplicate what the relational database is
doing already. This is a lot of work to duplicate what
the database did already. And should the spec change just
slightly (say you need a third field), you have a lot of
recoding (and debugging) to do.
Sounds like if you need even the simplest of basic
structures, relational databases give a nice development win
over a dbm. And should you need 2 tables and a join, well
your average programmer will probably program it by
searching each table once for each element in the other.
(Either explicitly or implicitly.) Avoiding that mistake
by default makes your algorithms scale much better. Need I
say more about how quickly the relational database pays
for itself?
But if your problem is dead simple, then the relational
database is a loss.
| [reply] |
|
|
Thanks tilly, for your advice. I had been trying to load a simple %hash variable with the key/value pairs and I have to take care of the duplicates. The key/value pairs exist in a oracle db, its not indexed on the key. wud hitting the db using dbi module be more efficient than trying to load up a %hash? I am using a monster dec alpha box with atleast 3 gb ram.
| [reply] |
|
|
I haven't read the whole exchange (sorry!), but if your reason for doing this is to weed out duplicates then I suggest you do that directly in SQL.
Something like
select distinct key, value into unique_table
from duplicate_table
should work, and shouldn't tax your "monster dec alpha box" excessively.
If you want to find which rows have duplicate keys you may have to add a COUNT(*) and a GROUP BY clause...
Michael
Update Note that the select statement above will only work if you have "create table" priviledges in the database...
| [reply] [d/l] |
Re: Re: millions of records in a Hash
by johnkj (Initiate) on Mar 06, 2002 at 21:13 UTC
|
hi joelba,
i have a 12 byte key and a 15 byte value, that i am trying to store in a %hash variable. the key is alphanumeric, the value is numeric. i think i might have hit some kinda memory limit. hard to imagine though i am using a box with 16gb virutal memory size and 3gb ram. so shudnt have had any issues but looks like it does. its a dec alpha box. been on other issues am now back to tackling this issue.. any wisdom will be gladly accepted..thanks in advance | [reply] |