can Perl handle a hash of tha size?
Yes. If your system has sufficient memory.
For a 32-bit Perl, you're limited (in most cases) to 2 GB of ram (regardless of how much memory the machine has installed), which is probably not enough for the size of file you've outlined. A hash holding 11 million pairs with 40 char keys and 80 char values requires ~2.4 GB.
If you have a 64-bit OS and a 64-bit perl installed, then your memory is limited by ram (+plus swap space, but don't go there). A reasonably modern commodity box with 4GB of ram and 64-bit OS & Perl would easily handle your 11 million records assuming your description means that each description and sequence is limited to ~80 characters each.
However, the file format you allude to (FASTA) frequently has multiple lines of sequence for each ID, and their sizes can individually be many 1000s or even millions of codons. So whether your machine can handle the dataset you need to work with depends upon not just the number of the records, but also their sizes.
As a rough rule of thumb, anticipate 100MB per 1million records + the size of the file on disk in MBs.
Eg. 11,000,000 records with 40 chars IDs and 80 char sequences = 11 * 100MB + 11e6 * 120 = 1.1GB + 1.2GB = 2.3GB total memory requirement.
In that case, as I can understand, the script would have to create the hash each time, so it would be rather time-consuming, no?
Not as long as you might think.
This create a 1 million key fasta file and reads it to create hash:
C:\test>perl -E"sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] }; +; say('>',rndStr( 40, 'a'..'z')),say(rndStr(80,qw[a c g t])) for 1 .. + 1e6" >junk.dat C:\test>prompt [$t] $p$g [ 9:41:54.22] C:\test>perl -E"$h{<>}=<> until eof(); print scalar keys + %h" junk.dat 1000000 [ 9:41:57.88] C:\test>
Under 4 seconds to load up the hash. So for 11 million keys, it will take less than a minute to load up the file. Assuming you have enough free memory to hold it.
It will take longer if your sequences are substantially larger.
would it be more efficient to store these records in a Mysql database and then retrieve them based on the NAME for example?
That really depends a lot upon how many times you are going to use the data?
Remember that when querying data from a DB, it still has to be loaded from disk. So do the indexes. And for an RDMBS like MySQL, then the data has to be serialised through a socket or pipe.
For a one-time, or just a few uses of the dataset, the time taken to set up the DB is pure overhead which can negate any gains. If you have to download and install the DB first, and then learn how to set it up and uses it. a flat file and a hash wins hands down.
If however you have teh DB, know how to use it and are going to be accessing the same dataset many times over an extended period, then the equations may swing the other way.
Can the database store that many items without trouble?
Any DB that couldn't handle that few records would not be worthy of the name. Even MySQL or SQLite shoudl easily handle low billions of records without trouble.
In reply to Re: Efficient way to handle huge number of records?
by BrowserUk
in thread Efficient way to handle huge number of records?
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |