Data. Lots of Data.

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Data. Lots of Data. by AgentM (Curate) on Feb 07, 2001 at 04:45 UTC
You say: An external database is out of the question. and Speed is of utmost importance. You make them sound mutually exclusive when in fact, you have an oxymoron. If you want real speed, then you want a real database. Perhaps if you explain why it is "out of the question", we can give you more appropriate guidance. Using a Any_DBFile may meet your criteria but no one claims that it screams speed. Why not use the database solution? AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.	[reply]
Re: Data. Lots of Data. by Hot Pastrami (Monk) on Feb 07, 2001 at 04:38 UTC
When you say "An external database is out of the question", does this mean that Berkely DB (and similar) cannot be used? They're pretty darn fast (compared to flat text files) and are easy to get your hands on, either included with Perl or on CPAN. Update: Sorry, that's supposed to be spelled Berkeley (the wrong spelling won't help much if you try to do a search on it). *Hot Pastrami*	[reply]
Re: Re: Data. Lots of Data. by Anonymous Monk on Feb 07, 2001 at 04:44 UTC
By external I meant mySQL or a non-Perl storage solution. Preferably, I would like to use modules that are packaged with the Perl distro. Thanks	[reply]
Re: Re: Re: Data. Lots of Data. by Hot Pastrami (Monk) on Feb 07, 2001 at 04:47 UTC
Depending on your distribution, Berkeley DB may be included. Even if it isn't included, it's unspeakably simple to install, well worth the small effort. It's just as easy to install as a home-grown module, but better, faster, and time-tested. *Hot Pastrami*	[reply]
Re: Data. Lots of Data. by Maclir (Curate) on Feb 07, 2001 at 04:50 UTC
You stated three requirements: I have got a tremendous amount of data that I need to have accessible at my program's finger tips at any time while it is being used. An external database is out of the question. Speed is of utmost importance. Unfortunately, by ruling out a database, you will not be able to achieve (1) and (2). Managing large volumes of data quickly and efficiently is what databases are for. You may not need to go to top shelf product like Oracle, DB/2 or Informix, but there are several open source database systems that do integrate with Perl. MySQL and Postgress are both used extensively. Alternatively, you could try to "roll your own". You could try it - there is no way I would think about even attempting it. Then again, it depends on just how much you mean by "lots".	[reply]
Re: Data. Lots of Data. by eg (Friar) on Feb 07, 2001 at 04:58 UTC
Buy more RAM. Seriously, you should reconsider your prohibition against using an external database. At the very least, carefully consider following Hot Pastrami's advice and using Berkeley DB. It's very much worth the effort. Depending on the nature of your lookups, it might be worth your while to memoize your data-accessing functions.	[reply]
Re: Data. Lots of Data. by MeowChow (Vicar) on Feb 07, 2001 at 09:57 UTC
How about a hardware solution? Low memory footprint and no database access and fast performance sounds like a job for an SSD (solid-state disk drive). Of course, you may not have $5000 burning a hole in your pocket, in which case, I'd go with the dbm file :) MeowChow print $/='"',(`$^X\144oc $^X\146aq1`)[-2]	[reply]
Re: Data. Lots of Data. by zeno (Friar) on Feb 07, 2001 at 18:42 UTC
You mentioned that you store most of the data in hashes. Why not try only loading data into the hash when it is asked for, using the hash as a cache? Then you get the speed of the hash without loading all of the data into the hash, when you may only need part of it. An example cribbed more or less from page 609 of Programming Perl, 3d edition: `#!/usr/bin/perl -w use strict; use warnings; sub get_data { # here we simulate the expensive operation # of loading data into the hash my $idx = shift; print "retrieving data for $idx...\n"; return "value for $idx"; } my $result; my %cache; foreach (1,2,4,16,32,2,64,1,2) { # here we are asking for values from the cache # (hash), and if they aren't there, we get them # from the file. $result = $cache{$_} \|\|= get_data($_); print "$result\n"; }` [download] The output from this program is: `retrieving data for 1... value for 1 retrieving data for 2... value for 2 retrieving data for 4... value for 4 retrieving data for 16... value for 16 retrieving data for 32... value for 32 value for 2 retrieving data for 64... value for 64 value for 1 value for 2` [download] As you can see, after the first time we ask for an element, we no longer have to go back to the file to get the data. Neat, eh? I hope that helps you. -zeno	[reply] [d/l] [select]
Re: Data. Lots of Data. by sierrathedog04 (Hermit) on Feb 07, 2001 at 17:49 UTC
Could this be a job for XQL? If you convert your data to XML, then you could use one of the XQL modules at CPAN. XQL allows you to use SQL-like syntax to access an XML tree of data, and it could be what you want. A potential problem is that some implementations of XQL hold the entire XML tree in memory, others do not. An implementation which did the former might use up all of your memory. The documentation for Enno Dirksen's XML-XQL at CPAN says that it uses the module XML-DOM, which means that it holds the entire XML tree in memory. The documentation says that future releases might use the module XML-Grove. Using XML-Grove would not hold the entire tree in memory, since avoiding memory-hogging XML apps is the purpose of XML-Grove. There is also an XML-miniXQL by Matt Sergeant at CPAN. It is not clear from reading its documentation what its memory requirements are. If I were going to do this project I would look into XML/XQL. The modules install easily and one can do simple applications in an afternoon.	[reply]
Re (tilly) 2: Data. Lots of Data. by tilly (Archbishop) on Feb 07, 2001 at 17:55 UTC
XML may be many things, but high-performance and memory efficiency are not two of its strengths. Quite the opposite in fact. The format is portable, flexible, extensible...but this comes with considerable overhead.	[reply]
Re: Re: Data. Lots of Data. by mirod (Canon) on Feb 07, 2001 at 18:13 UTC
XML::XQL might be your hammer but we are not dealing with nails here! XML::XQL is based on XML::DOM as you said, which means that it eats up RAM like there's no tomorrow, about 10 times the original size of the XML data (which is already much more than CSV for example). XML::Grove does not seem to be very well supported from comments to the review and from the fact that the last release is dated September 9th 1999, and is slow as a dog in any case. XML::miniXQL is not exactly supported either, the last version was released June 16th 1999. While you might like XML::XQL it is certainly not the ultimate in terms of data munging. Sorry for the plug but you might want to have a look at Processing XML with Perl or Ways to Rome to see a couple of comparisons of various XML modules along with benchmarks. In this case a data base solution seems like the only way to go (and I am as much of an XML evangelist as you might find around here ;--)	[reply]

AgentM Systems nor Nasca Enterprises nor Bone::Easy nor Macperl is responsible for the comments made by AgentM. Remember, you can build any logical system with NOR.