Table Generation vs. Flat File vs. DBM

mhearse has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Table Generation vs. Flat File vs. DBM by Corion (Patriarch) on May 05, 2004 at 06:15 UTC
Optimizing here largely depends on whether you will need to access all these values during the run of your program or not. If you don't need to access all values during one run, it may save you some time to separate the hash data into an external indexed file, as then Perl won't have to build the hash from a list every time the process is started. If you need to access all the data all the time, you could try Storable to make Perl load the data structures faster than from source code. A real database would be ideal, as it could cache the data in RAM between runs of your programs, but you already said that was out of the question. The "next best thing" could also be to start your program through PPerl, so the bulk of it stays resident, or to write a custom "data daemon", which only holds that data in RAM and serves inquiries to that data. All these solutions sound interesting, but first of all, you need to benchmark, benchmark, Benchmark and keep track of your changes, together with the benchmarks, for example in an Excel sheet, so you can track your progress and, much more importantly, figure out whether the increased risks in maintainability and program operation are worth the increase in throughput. If you have one long-running process, the initialization cost is most likely amortized over time anyway and changes to your processing algorithm will yield much better results than changes to the process startup.	[reply]
Re: Re: Table Generation vs. Flat File vs. DBM by mhearse (Chaplain) on May 05, 2004 at 06:39 UTC
Right now, I would say the program is accessing all the tables all the time. I guess coding it to look for only what it needs would obviously be better (faster). I'm going to investigate your external indexed file suggestion.	[reply]
Re: Table Generation vs. Flat File vs. DBM by BrowserUk (Patriarch) on May 05, 2004 at 08:34 UTC
You might take a look at Storable Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply]
Re: Table Generation vs. Flat File vs. DBM by pizza_milkshake (Monk) on May 05, 2004 at 06:10 UTC
how is the data currently being generated? obviously the speed with only improve the difference between however it's being generated now and reading it off the disk. in general, using DBM is easy, i just played with it the other day and got it working within an hour of being told i needed to work with it (and having never used it in perl before) perl -e"\$_=qq/nwdd\x7F^n\x7Flm{{llql0}qs\x14/;s/./chr(ord$&^30)/ge;print"	[reply]
Re: Re: Table Generation vs. Flat File vs. DBM by mhearse (Chaplain) on May 05, 2004 at 06:19 UTC
The large tables are generated with map and grep loops. Basically making many combinations of a few things. I guess I will try dumping the static values in a flat file and DBM file, then benchmark it both ways. I guess my question now "is opening and reading a flat file faster that opening and reading a DBM file. Is there any performance benefits with one or the other?"	[reply]
Re: Re: Re: Table Generation vs. Flat File vs. DBM by graff (Chancellor) on May 05, 2004 at 07:01 UTC
YMMV, depending on which flavor of DBM you pick and how your hardware gets along with it. As a general rule, writing to any sort of DBM file tends to be somewhat more expensive than writing a flat file, in terms of overall space consumed, amount of actual disk i/o performed, and total cpu time required. But when reading data back after you've stored it, a DBM file is vastly better, especially when fetching values in a quasi-random fashion from a very large set -- or at least, whenever the fetching order is very different from the storage order. In such cases, doing repeated sequential searches over a flat file will kill you, whereas the DBM file is really just a big hash array on disk, optimized to deliver any chosen piece of data in a consistently short amount of time. So the question really is "what sort of access do you really need when reading the data back?" If you can easily write a flat file such that you just need to read it back once from beginning to end, then a flat file will be the better choice.	[reply]