TBH I could not wrap my head around how to dynamically write to DBs like Storable and DBM::Deep (I am assuming this is what you mean by DB??). The hash structure in the original program is HoA's that grows
$ProbState{$State}[$IndexLine]=$linelist[$i+1];
Yup 16GB is underpowered for a 3 yo laptop but I am stuck with it. Would you please kindly direct me to an example or a working code that demonstrates how I can use a DB.
UPDATE: I had some luck with DBM::Deep where I can store and retrieve from the database however the write step seems to be very slow although it keeps the memory footprint acceptable.
Something or the other, a monk since 2009
| [reply] [d/l] [select] |
| [reply] [d/l] |
I was thinking of an SQL DB. My recommendation would be to start with the most simple thing, an SQLite DB. This doesn't have fancy multi-threading to process SQL commands, but it does use standard SQL and the code that you write for it can be used with more capable DB's like Postgress, etc. If SQL (pronounced "See-quel") is a foreign word to you, all the more reason to start with something simple. The installation, configuration and management of a DB server can get complicated. SQLite doesn't require any of that. In addition, it is possible to dynamically vary the memory footprint available to SQLite. I've used that feature before and it works. It will use the amount of memory that it has - it might get a lot slower with less memory, but it won't "blow up".
I have no idea of what your raw data looks like or what end report you are trying to generate. Showing failed Perl code is not enough. You will have to back up and explain the initial problem (show some abbreviated example data) and some example of an end result. Then perhaps I or other Monks can suggest an appropriate table structure.
Don't underestimate SQLite. One project that I'm working on now has 3 directories with 3,000+ files in each. Creating the table for each directory takes about 8 seconds for 1M rows. Processing each directories worth of data takes <2 seconds. I am telling you that 3M row DB is nothing. How many million lines do you have? and how much data is each line? It very could be that instead of a single complete "de-normalized" table, you will wind up with multiple tables that are related somehow. For one project, I wound up using the URL that the data came from as a link between 2 tables. Didn't have to be that way, but that was sufficient for that project. DB's often use integer values as "keys" that link tables, but it doesn't have to be that way.
I don't know enough to advise further.
UPDATE: re: "UPDATE: I had some luck with DBM::Deep where I can store and retrieve from the database however the write step seems to be very slow although it keeps the memory footprint acceptable."
The "write step" to a DB is fast. The "commit step" is slow and DB journaling does take time. I have never used DBM::Deep. With SQLite as with other SQL db's, you want to: $db->begin_work;, then do millions of inserts, then do $dbh->commit; A single transaction can have millions of rows. I do recommend switching to SQLite.
SQLite is written in C. DBM::Deep is written in pure Perl. SQLite is the most used DB by far in the world - its in your cell phone - its in your browser - its everywhere. There are many performance tweaks that can be done to SQLite - often at the expense of compromising ACID properties. However, sometimes this is completely appropriate. In my example referenced above, I could speed things up by stopping journaling, starting async writes, etc. In my case 30 sec is "fast enough" for me and I don't worry about DB creation. 3 vs 30 seconds is the same for me. The "big lifting" is making a million inserts one transaction. But I could make my app faster if I really wanted to or needed to (which I don't).
Recommendation: Switch to SQLite. Faster, more options, gigantic user base, very stable. It already comes with the DBI module, so it is already on your machine.
Another Update: Re: Memory usage - DBM::Deep is a Perl program. SQLite is a C program and it can and does play by different memory management rules. I am not sure if this limit still exists or not, but at one time SQLite was limited to 2GB of RAM. Its default is way, way less than that. In one application that I targeted for WinXP, I ran ran SQLite's memory up to the extravagant level of 500MB for one "expensive" indexing operation and then back down again after that. A Perl program cannot give memory back to the O/S, but SQLite can. None of this matters for DB creation, but if you needed say to make a histogram of a column with 50M rows, more memory probably would help. I would have to look back at previous code to find the commands for adjusting memory, but they do exist. My XP application was targeted at a machine which only had max of 1-2GB of RAM. For your app, I would run memory footprint up to 2GB and not worry about it after that.
| [reply] [d/l] [select] |
I appreciate the comprehensive reply Marshall. I elaborated on the data and the actual code in one of the responses to this thread.
The current project that is blowing the RAM is the smallest benchmark before scaling to the wider dataset and I can foresee issues with reading and retrieving despite MCE gettomg to the rescue (for now). Hence, I see the case for SQL/SQLite (which remained outside my toolbox since ages because of the bittersweet relationship I had with Oracle DBs).
The next step is to statistically analyse the final data through R. I am kinda apprehensively curious about how R might take reading the file.
In fact the DBM:Deep::Cookbook documents the slow performance
Because DBM::Deep is a conncurrent datastore, every change is flushed to disk immediately and every read goes to disk. This means that DBM::Deep functions at the speed of disk (generally 10-20ms) vs. the speed of RAM (generally 50-70ns), or at least 150-200x slower than the comparable in-memory datastructure in Perl.
There are several techniques you can use to speed up how DBM::Deep functions.
Put it on a ramdisk
The easiest and quickest mechanism to making DBM::Deep run faster is to create a ramdisk and locate the DBM::Deep file there. Doing this as an option may become a feature of DBM::Deep, assuming there is a good ramdisk wrapper on CPAN.
Work at the tightest level possible
It is much faster to assign the level of your db that you are working with to an intermediate variable than to re-look it up every time.
Something or the other, a monk since 2009
| [reply] [d/l] |