Re (tilly) 2: millions of records in a Hash

Replies are listed 'Best First'.
Re: Re (tilly) 2: millions of records in a Hash by Buggs (Acolyte) on Feb 25, 2002 at 06:20 UTC
You are sure right that DBMs, like BDB, seem to fit the problem well. On the otherhand maybe the seeker needs multiuser access or wants to remotely connect to his data. We don't know that. After all RDBMs provide much more services and usually fit smaller problems too, often also avoiding scaling problems. So the general advise is often to use RDBMs where DBMs would suffice, that is not always based on a misunderstanding, maybe even more often based on the simple fact that people deal more with RDBMs than DBMs. People are running multiuser operating systems for mere desktop usage after all. And many are happy with that, even so a single user system would suffice. Another approach that often is forgotten is to write your own storage method, which given the seekers description doesn't seem to be out of hand and could well result in the most performant solution.	[reply]
Re (tilly) 4: millions of records in a Hash by tilly (Archbishop) on Feb 25, 2002 at 07:04 UTC
This is all true. But I am still cautious about telling people to use an RDBM when they either don't have the background to understand one (and I don't have energy to teach that), or they might understand both and have a good reason for using a dbm. As for writing your own storage method, I would strongly discourage people from doing that unless they already know, for instance, the internals of how a dbm works. And if someone comes back and asks me for that, my response will be that if they have to ask, the odds are that I can't teach them enough about the subject to do any better than they can do just by using the already existing wheel. And this is definitely true if they think they can build their wheel in Perl.	[reply]
Re: Re (tilly) 4: millions of records in a Hash by Buggs (Acolyte) on Feb 25, 2002 at 19:51 UTC
tilly: "As for writing your own storage method, I would strongly discourage people from doing that unless they already know, for instance, the internals of how a dbm works. And if someone comes back and asks me for that, my response will be that if they have to ask, the odds are that I can't teach them enough about the subject to do any better than they can do just by using the already existing wheel.And this is definitely true if they think they can build their wheel in Perl." Given the precise knowledge of the data's signature, the seeker said something about 12 byte keys e.g., one can build very fine wheels using optimized algorithms, with perl or without. Naturally a starting point would be to look at a DBM implementation, but I wonder why in a Seekers of Perl Wisdom section one would recommend not going the hard way and learn a lot of stuff. And if you can't teach him/her, there might either be others who can or the seeker might just go his own way and find out himself.	[reply]
Re (tilly) 6: millions of records in a Hash by tilly (Archbishop) on Feb 26, 2002 at 03:37 UTC
Re: Re (tilly) 6: millions of records in a Hash by Buggs (Acolyte) on Feb 26, 2002 at 04:45 UTC
Re: Re: Re (tilly) 4: millions of records in a Hash by dragonchild (Archbishop) on Feb 25, 2002 at 20:35 UTC
Re: Re: Re: Re (tilly) 4: millions of records in a Hash by Buggs (Acolyte) on Feb 25, 2002 at 21:02 UTC
Some notes below your chosen depth have not been shown here
Re: Re (tilly) 2: millions of records in a Hash by joealba (Hermit) on Feb 25, 2002 at 19:01 UTC
Good point, tilly. The relational aspects of Oracle and MySQL may not help (depending on the data). But, I assume that database packages like these use more optimal storage than DB_File, which may work better for such large data sets. Am I wrong?	[reply]
Re (tilly) 4: millions of records in a Hash by tilly (Archbishop) on Feb 26, 2002 at 01:36 UTC
With significant caveats, yes, you are wrong. The key to handling large data sets is to have efficient data structures and algorithms. A dbm does this. Given that there aren't significantly better ones available, a relational database cannot improve much. Oh, sometimes it might be possible to get a slight win from a known fixed structure. Mostly if you do, you lose it several times over from having the relational database. (Particularly if, as is usually the case, you have the database code in another process that your process needs to communicate with.) However change the picture by saying that you don't have a key/value relationship, but rather have a tabular structure where you want to be able to do quick lookups on either of two different fields. Stuff that into any decent relational database, add an index on those two fields. Done. What do you have to do to get that with dbms? Well, you could store your data in a long linear file, and then store offsets in a key/value relationship in a couple of dbms (one for each index). This is a lot of custom code to duplicate what the relational database is doing already. This is a lot of work to duplicate what the database did already. And should the spec change just slightly (say you need a third field), you have a lot of recoding (and debugging) to do. Sounds like if you need even the simplest of basic structures, relational databases give a nice development win over a dbm. And should you need 2 tables and a join, well your average programmer will probably program it by searching each table once for each element in the other. (Either explicitly or implicitly.) Avoiding that mistake by default makes your algorithms scale much better. Need I say more about how quickly the relational database pays for itself? But if your problem is dead simple, then the relational database is a loss.	[reply]
Re: Re (tilly) 2: millions of records in a Hash by johnkj (Initiate) on Mar 06, 2002 at 21:08 UTC
Thanks tilly, for your advice. I had been trying to load a simple %hash variable with the key/value pairs and I have to take care of the duplicates. The key/value pairs exist in a oracle db, its not indexed on the key. wud hitting the db using dbi module be more efficient than trying to load up a %hash? I am using a monster dec alpha box with atleast 3 gb ram.	[reply]
Re: Re: Re (tilly) 2: millions of records in a Hash by mpeppler (Vicar) on Mar 07, 2002 at 01:26 UTC
I haven't read the whole exchange (sorry!), but if your reason for doing this is to weed out duplicates then I suggest you do that directly in SQL. Something like `select distinct key, value into unique_table from duplicate_table` [download] should work, and shouldn't tax your "monster dec alpha box" excessively. If you want to find which rows have duplicate keys you may have to add a COUNT() and a GROUP BY clause... Michael Update* Note that the select statement above will only work if you have "create table" priviledges in the database...	[reply] [d/l]