in reply to Efficient way to handle huge number of records?

Hi there,

If the text file is the same each time, a good option you could use that is simpler than using SQL is using storables.

This way, you can load the data into a hash once and then store it as a serialized file which can be retrieved fairly quickly. I also had a similar bioinformatics problem which was solved this way
Good Luck!
MrGuy
  • Comment on Re: Efficient way to handle huge number of records?

Replies are listed 'Best First'.
Re^2: Efficient way to handle huge number of records?
by Anonymous Monk on Dec 11, 2011 at 12:58 UTC
    Yes that might be actually the solution to my problem!
    I have always the same kind of database file, which is:
    >protein1 ASKRKGRKGTTRHTRHTRHTRHTHRHTR >protein2 REHGTRYUYTJUYJYTJUJUJUJYJUYJUY >protein3 EWFDEWFEFREFRE

    Apparently I'll create a "big" hash with protein1, protein2, protein3 etc as keys and ASKRKGRKGTTRHTRHTRHTRHTHRHTR,  REHGTRYUYTJUYJYTJUJUJUJYJUYJUY,  EWFDEWFEFREFRE etc as values.
    I am a bit confused as about the usage of this module though I browsed a bit in the respective manual page... Once you create the hash, how do you call it? Is it retrieved each time the script is called?
    If you can please supply a brief example of usage!
    Thanks a bunch!
      Creating an SQLite DB is not that hard. I haven't tested this code, but this is at least the general idea.
      #!/usr/bin/perl -w use strict; use DBI; my $dbfile = "./DNA.sqlite"; print "Creating new DNA Database\n"; if (-e $dbfile) {unlink $dbfile or die "Delete of $dbfile failed! $!\n +";} my $dbh = DBI->connect("dbi:SQLite:name=$dbfile","","",{RaiseError => +1}) or die "Couldn't connect to database: " . DBI->errstr; $dbh->do ("CREATE TABLE dna ( id integer PRIMARY KEY AUTOINCREMENT, protein varchar(10) DEFAULT '', sequence varchar(1000) DEFAULT '' ); "); $dbh->do('PRAGMA synchronous = 0'); # Non transaction safe!!! $dbh->do('PRAGMA cache_size = 200000'); # 200 MB dynamic cache increa +se # makes index creation faster $dbh->do("BEGIN"); import_data(); $dbh->do("COMMIT"); $dbh->do("BEGIN"); $dbh->do ("CREATE INDEX iprotein_idx ON dna (protein)"); $dbh->do("COMMIT"); sub import_data { my $add = $dbh->prepare("INSERT INTO dna ( protein, sequence) VALUES(?,?)"); #...your loop to read the data goes here # foreach protein and sequence... { $add->execute($protein, $sequence); } # }
      Update:
      Basically when creating the DB, you want to turn all the ACID stuff off.
      That means: don't wait for writes to complete, don't do complete transactions for each "add", run the cache size up from default of 20MB to at least 200 MB for more efficient index creation. Run as "lean and mean" as you can.

        I don't know your PRAGMA code, but in general this is a possible solution.

        The only thing where I'd like to comment is that DBI allows and you should use a $dbh->begin_work and a $dbh->commit instead of $dbh->do( "BEGIN" ) and $dbh->do( "COMMIT" );

      Hi, it is fairly simple. You store the hash in a file, and then retrieve it when you want to use it (see example below). It is retrieved as a hashref, but you can make it a regular hash with a '%' prefix
      use Storable; store \%table, 'file'; $hashref = retrieve('file');
      Good luck
      MrGuy