Re: Efficient way to handle huge number of records?

Replies are listed 'Best First'.
Re^2: Efficient way to handle huge number of records? by Anonymous Monk on Dec 11, 2011 at 12:58 UTC
Yes that might be actually the solution to my problem! I have always the same kind of database file, which is: `>protein1 ASKRKGRKGTTRHTRHTRHTRHTHRHTR >protein2 REHGTRYUYTJUYJYTJUJUJUJYJUYJUY >protein3 EWFDEWFEFREFRE` [download] Apparently I'll create a "big" hash with `protein1, protein2, protein3` etc as keys and `ASKRKGRKGTTRHTRHTRHTRHTHRHTR`, `REHGTRYUYTJUYJYTJUJUJUJYJUYJUY`, `EWFDEWFEFREFRE` etc as values. I am a bit confused as about the usage of this module though I browsed a bit in the respective manual page... Once you create the hash, how do you call it? Is it retrieved each time the script is called? If you can please supply a brief example of usage! Thanks a bunch!	[reply] [d/l] [select]
Re^3: Efficient way to handle huge number of records? by Marshall (Canon) on Dec 11, 2011 at 13:47 UTC
Creating an SQLite DB is not that hard. I haven't tested this code, but this is at least the general idea. #!/usr/bin/perl -w use strict; use DBI; my $dbfile = "./DNA.sqlite"; print "Creating new DNA Database\n"; if (-e $dbfile) {unlink $dbfile or die "Delete of $dbfile failed! $!\n +";} my $dbh = DBI->connect("dbi:SQLite:name=$dbfile","","",{RaiseError => +1}) or die "Couldn't connect to database: " . DBI->errstr; $dbh->do ("CREATE TABLE dna ( id integer PRIMARY KEY AUTOINCREMENT, protein varchar(10) DEFAULT '', sequence varchar(1000) DEFAULT '' ); "); $dbh->do('PRAGMA synchronous = 0'); # Non transaction safe!!! $dbh->do('PRAGMA cache_size = 200000'); # 200 MB dynamic cache increa +se # makes index creation faster $dbh->do("BEGIN"); import_data(); $dbh->do("COMMIT"); $dbh->do("BEGIN"); $dbh->do ("CREATE INDEX iprotein_idx ON dna (protein)"); $dbh->do("COMMIT"); sub import_data { my $add = $dbh->prepare("INSERT INTO dna ( protein, sequence) VALUES(?,?)"); #...your loop to read the data goes here # foreach protein and sequence... { $add->execute($protein, $sequence); } # } [download] Update: Basically when creating the DB, you want to turn all the ACID stuff off. That means: don't wait for writes to complete, don't do complete transactions for each "add", run the cache size up from default of 20MB to at least 200 MB for more efficient index creation. Run as "lean and mean" as you can.	[reply] [d/l]
Re^4: Efficient way to handle huge number of records? by leuchuk (Novice) on Dec 11, 2011 at 14:29 UTC
I don't know your PRAGMA code, but in general this is a possible solution. The only thing where I'd like to comment is that DBI allows and you should use a `$dbh->begin_work` and a `$dbh->commit` instead of `$dbh->do( "BEGIN" )` and `$dbh->do( "COMMIT" )`;	[reply] [d/l] [select]
Re^5: Efficient way to handle huge number of records? by Marshall (Canon) on Dec 11, 2011 at 14:51 UTC
Re^3: Efficient way to handle huge number of records? by mrguy123 (Hermit) on Dec 11, 2011 at 14:37 UTC
Hi, it is fairly simple. You store the hash in a file, and then retrieve it when you want to use it (see example below). It is retrieved as a hashref, but you can make it a regular hash with a '%' prefix `use Storable; store \%table, 'file'; $hashref = retrieve('file');` [download] Good luck MrGuy	[reply] [d/l]