in reply to Efficient way to handle huge number of records?

  1. can Perl handle a hash of tha size?

    Yes. If your system has sufficient memory.

    For a 32-bit Perl, you're limited (in most cases) to 2 GB of ram (regardless of how much memory the machine has installed), which is probably not enough for the size of file you've outlined. A hash holding 11 million pairs with 40 char keys and 80 char values requires ~2.4 GB.

    If you have a 64-bit OS and a 64-bit perl installed, then your memory is limited by ram (+plus swap space, but don't go there). A reasonably modern commodity box with 4GB of ram and 64-bit OS & Perl would easily handle your 11 million records assuming your description means that each description and sequence is limited to ~80 characters each.

    However, the file format you allude to (FASTA) frequently has multiple lines of sequence for each ID, and their sizes can individually be many 1000s or even millions of codons. So whether your machine can handle the dataset you need to work with depends upon not just the number of the records, but also their sizes.

    As a rough rule of thumb, anticipate 100MB per 1million records + the size of the file on disk in MBs.

    Eg. 11,000,000 records with 40 chars IDs and 80 char sequences = 11 * 100MB + 11e6 * 120 = 1.1GB + 1.2GB = 2.3GB total memory requirement.

  2. In that case, as I can understand, the script would have to create the hash each time, so it would be rather time-consuming, no?

    Not as long as you might think.

    This create a 1 million key fasta file and reads it to create hash:

    C:\test>perl -E"sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] }; +; say('>',rndStr( 40, 'a'..'z')),say(rndStr(80,qw[a c g t])) for 1 .. + 1e6" >junk.dat C:\test>prompt [$t] $p$g [ 9:41:54.22] C:\test>perl -E"$h{<>}=<> until eof(); print scalar keys + %h" junk.dat 1000000 [ 9:41:57.88] C:\test>

    Under 4 seconds to load up the hash. So for 11 million keys, it will take less than a minute to load up the file. Assuming you have enough free memory to hold it.

    It will take longer if your sequences are substantially larger.

  3. would it be more efficient to store these records in a Mysql database and then retrieve them based on the NAME for example?

    That really depends a lot upon how many times you are going to use the data?

    Remember that when querying data from a DB, it still has to be loaded from disk. So do the indexes. And for an RDMBS like MySQL, then the data has to be serialised through a socket or pipe.

    For a one-time, or just a few uses of the dataset, the time taken to set up the DB is pure overhead which can negate any gains. If you have to download and install the DB first, and then learn how to set it up and uses it. a flat file and a hash wins hands down.

    If however you have teh DB, know how to use it and are going to be accessing the same dataset many times over an extended period, then the equations may swing the other way.

  4. Can the database store that many items without trouble?

    Any DB that couldn't handle that few records would not be worthy of the name. Even MySQL or SQLite shoudl easily handle low billions of records without trouble.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^2: Efficient way to handle huge number of records?
by flexvault (Monsignor) on Dec 11, 2011 at 11:36 UTC

    BrowserUk,

    Just to comment on the 32/64 bit (re. Linux), a 32 bit Linux system can support 64GB of memory, but each process is limited to 4GB ( usually 3.7GB ). So each Perl application would have approximately 3.7 GB to handle script and data.

    In the windows world, you are absolutely correct.

    Thank you

    "Well done is better than well said." - Benjamin Franklin

      Hm. You're stretching several boundaries beyond their limits there:

      The 32bit memory mapping system supports either 4GB of address space without PAE or 64GB with PAE. But that does not necessarily tell you how much Linux supports with/without PAE.

      Linux also introduces constraints on total physical memory based on interactions with the way it manages kernel virtual memory. That leads to at least four different levels of memory support based on choices made during kernel build.

      The lowest level is 896MB without PAE
      The next level is about 3.25GB (bios limited) without PAE
      The next level is, I think, about 16GB, with PAE

      The highest level, I think, is the full 64GB with PAE plus an ugly kludge in kernel virtual memory (I think a bad idea. Use 64bit instead for that much ram).

      Win32 can also (and I believe was first) to do Page Address Extension (PAE). It can also extend the default 2GB user space to 3GB per process. But just like linux, these limits are extended through a series of cludges that have drawbacks as well as benefits.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        Yes, even Windows 32 bit XP can go to 3GB for a user, but there can be problems if that feature is enabled.

        This appears to be a job for a DB, if many searches will be performed after the DB is "built" (initialized and indexed).

        From the problem statement, I think that SQLite will do the job just fine. At such time that it does not, then the SQL will work on another DB.

        BrowserUk

        I don't have any real experience with 32bit Linux machines with more than 2GB of memory, so your knowledge in this area is gospel. The specs said it could, but I didn't realize it was a hack.

        I do have more experience with 32/64bit Unix machines, and found that most 64bit applications (including Perl) required more than twice the real memory as on a 32bit Unix. Now, I haven't checked this in quite a while, so I'll have to revisit that. It could have been the 64bit version of the compiler or something else. I will be installing an IBM power 7 p-series p740 with 128GB in January. I will try some benchmarks and let you know. Power 7 is only 64bit.

        Thank you

        "Well done is better than well said." - Benjamin Franklin

Re^2: Efficient way to handle huge number of records?
by erix (Prior) on Dec 11, 2011 at 14:11 UTC
    Any DB that couldn't handle that few records would not be worthy of the name. Even MySQL or SQLite shoudl easily handle low billions of records without trouble.

    I would be quite interested to see SQLite do this. (may even try it myself...)

    In the past (last time I tried was, I think, a couple of years ago) SQLite always proved prohibitively slow: loading multimillion-row data was so ridiculously slow (even on fast hardware), that I never bothered with further use.

    I'd love to hear that this has improved - SQLite is nice, when it works. Does anyone have recent datapoints?

    (As far as I am concerned, Mysql and BerkeleyDB, as oracle products, are not an serious option anymore (I am convinced Oracle will make things worse for non-paying users all the time), but I am interested to know how their performance (or Oracle's itself for that matter) compare to PostgreSQL)

      That is not true. the SQLite is the fastest DB engine I ever come across. You just need to increase the buffer size of the read input to let say 4 MB within a transaction and you will see that it can import the above values with no problem under a minute where MySQL will take much longer. And since it stores everything in a RAM query time is going to be again much much more faster. So if you or anyone is looking for a fast DB engine without some fancy-shmancy features that i rarely use anyway the SQLite is the way to GO.

      So if you need a db engine that is fast and reliable and can deal with lots of data you will want SQLite.

      Now as far as the initial question goes, you can do something similar to what MySQL does. Yu could split the file into chunks and index chunks by the line numbers so that you know in which line does the header of you sequence appear. Once you did that you need to hash only those indexes. This will reduce the search the number of times prop. to the number of fragments you have after chomping your initial

      In the past (last time I tried was, I think, a couple of years ago) SQLite always proved prohibitively slow: loading multimillion-row data was so ridiculously slow

      I said "handle" not "handle well" :)

      That said, I had SQLite on my old machine and found that .import file table via the sqlite3.exe was substantially faster than doing inserts via SQL. Whether from the command line utility or via Perl & DBI.

      I wish I could get a 64-bit build for my system.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

      erix,

      Sorry to take so long, real work got in the way.

      Hears the updated info, I used BrowserUk's sub to generate the data. On the key part is was 40bytes long, but on the data part it was 320bytes, so I 'substr' it to 80 (If 320 is correct, I can run the tests again). I ran the test for 1_000_000, since for your purposes it didn't matter. Also I multipled the times by 1000 to get the results in milliseconds. I generated the random keys at the beginning, so that they wouldn't be in cache.

      Here it is:

      while ( $cnt < $howmany ) { $key = rndStr( 40, 'a'..'z'); $data = substr(rndStr(80,qw[a c g t]),0,80)); if ( ( ( $cnt % 113 ) == 0 )&&( scalar keys %khash < 10 ) +) { $khash{$key} = 0; } . . . for ( 1 .. 4 ) { foreach $key ( keys %khash ) { $stime = gettimeofday; $ret = $cursor->c_get($key, $data, DB_SET); $etime = sprintf("%.6f",(gettimeofday - $stime) * 1_000 +); print " $key Time: $etime ms\t$hkey\n"; } }

      Running it and Output:

      # time perl Show11M_mod.plx cds_enabled ## Start: VSZ-10292_KB RSS-4828_KB BLOCK: 512 ( 1000000 ) Write: 1049.66578292847 952/sec 1000000 ReadNext: 28.9542100429535 34537/sec Total: 1000000 ## End: VSZ-10292_KB RSS-6284_KB Diff:0|1456_KB BLOCK: 512 rijrxyzhfvfhvpktkiedvmnpwdphswhavejjwqvr Time: 0.164032 ms evxacpuyerimyidhwfqnvqsjqzrdpgwxzywssakk Time: 0.089884 ms qrckdiakaaanjsrnvsswzuebxmtxeaznhpwdqgfn Time: 0.064135 ms pxlyvhbaujsfdwzsdjterlqeiothhpdzljizypbi Time: 0.066996 ms wfbqhvgjnltboojbctaszbaxlcwibjdjgmwzcusu Time: 0.050068 ms ukotkvoceuchbrrdegkixjdegzqclfxbwkdvrnkj Time: 0.043869 ms dcrcpnxnuhfrwmysbxnfmbzqhgeblvoyczoqboef Time: 0.052929 ms xsgzxvlivfwqirwmpjpdnbtifuvjqmbthmgtnbxh Time: 0.050068 ms qntwonibxslleldmlvanodhzlqhweeihlsarfznj Time: 0.053167 ms rpflfufduuqvtkydqswvgnyionloswworrdraplt Time: 0.057936 ms rijrxyzhfvfhvpktkiedvmnpwdphswhavejjwqvr Time: 0.012875 ms evxacpuyerimyidhwfqnvqsjqzrdpgwxzywssakk Time: 0.011921 ms qrckdiakaaanjsrnvsswzuebxmtxeaznhpwdqgfn Time: 0.010967 ms pxlyvhbaujsfdwzsdjterlqeiothhpdzljizypbi Time: 0.010967 ms wfbqhvgjnltboojbctaszbaxlcwibjdjgmwzcusu Time: 0.010967 ms ukotkvoceuchbrrdegkixjdegzqclfxbwkdvrnkj Time: 0.011206 ms dcrcpnxnuhfrwmysbxnfmbzqhgeblvoyczoqboef Time: 0.010967 ms xsgzxvlivfwqirwmpjpdnbtifuvjqmbthmgtnbxh Time: 0.010967 ms qntwonibxslleldmlvanodhzlqhweeihlsarfznj Time: 0.012159 ms rpflfufduuqvtkydqswvgnyionloswworrdraplt Time: 0.010967 ms rijrxyzhfvfhvpktkiedvmnpwdphswhavejjwqvr Time: 0.011921 ms evxacpuyerimyidhwfqnvqsjqzrdpgwxzywssakk Time: 0.012159 ms qrckdiakaaanjsrnvsswzuebxmtxeaznhpwdqgfn Time: 0.012159 ms pxlyvhbaujsfdwzsdjterlqeiothhpdzljizypbi Time: 0.010967 ms wfbqhvgjnltboojbctaszbaxlcwibjdjgmwzcusu Time: 0.010014 ms ukotkvoceuchbrrdegkixjdegzqclfxbwkdvrnkj Time: 0.010967 ms dcrcpnxnuhfrwmysbxnfmbzqhgeblvoyczoqboef Time: 0.010014 ms xsgzxvlivfwqirwmpjpdnbtifuvjqmbthmgtnbxh Time: 0.010967 ms qntwonibxslleldmlvanodhzlqhweeihlsarfznj Time: 0.010967 ms rpflfufduuqvtkydqswvgnyionloswworrdraplt Time: 0.010014 ms rijrxyzhfvfhvpktkiedvmnpwdphswhavejjwqvr Time: 0.011921 ms evxacpuyerimyidhwfqnvqsjqzrdpgwxzywssakk Time: 0.011921 ms qrckdiakaaanjsrnvsswzuebxmtxeaznhpwdqgfn Time: 0.010967 ms pxlyvhbaujsfdwzsdjterlqeiothhpdzljizypbi Time: 0.010967 ms wfbqhvgjnltboojbctaszbaxlcwibjdjgmwzcusu Time: 0.010967 ms ukotkvoceuchbrrdegkixjdegzqclfxbwkdvrnkj Time: 0.010967 ms dcrcpnxnuhfrwmysbxnfmbzqhgeblvoyczoqboef Time: 0.010967 ms xsgzxvlivfwqirwmpjpdnbtifuvjqmbthmgtnbxh Time: 0.010967 ms qntwonibxslleldmlvanodhzlqhweeihlsarfznj Time: 0.010967 ms rpflfufduuqvtkydqswvgnyionloswworrdraplt Time: 0.010967 ms real 18m17.387s user 1m52.459s sys 0m34.850s

      Regards...Ed

      "Well done is better than well said." - Benjamin Franklin

        I used BrowserUk's sub to generate the data. On the key part is was 40bytes long, but on the data part it was 320bytes, so I 'substr' it to 80 ...
        $data = substr(rndStr(80,qw[a c g t]),0,80));

        Sorry, but you must have typo'd or c&p'd my code incorrectly, because there should be no need to substr the output of rndStr():

        sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] };; $x = rndStr( 80, qw[a c g t] );; print length $x, ':', $x;; 80 : actaatcttgcgccgcggcttcatacgagatgaatagtacgaaaacttggatacacctgtatcat +agaagggccgctgcg

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

      In the past (last time I tried was, I think, a couple of years ago) SQLite always proved prohibitively slow: loading multimillion-row data was so ridiculously slow (even on fast hardware), that I never bothered with further use.

      I showed the framework of some code at Re^3: Efficient way to handle huge number of records? which is an abbreviated version of some code that I'm currently working on. One table has a million records of 50 fields. So I ran a couple of tests.

      First test was with all the speed-up stuff turned off:

      Starting: Fri Dec 16 13:56:56 2011 Creating new Database!! - HD table records inserted: 1,117,526 Ending: Sun Dec 18 02:15:30 2011
      Now I figure that qualifies as "ridiculously slow!". I actually had to run it twice because I got one of those "Windows Automatic Reboot was required" things! Bummer when that happens after one day of processing!

      Using the optimizations and by FAR and away the biggest effect is to do a single transaction! Results in:

      Starting: Sun Dec 18 15:26:53 2011 Creating new Database!! - HD table records inserted: 1,117,526 Ending: Sun Dec 18 15:29:44 2011
      Or about 3 minutes instead of 2 days! A lot better! This is fast enough for my needs. Using the bulk import utility would probably be faster, but I haven't tested that because 3 minutes doesn't bother my application.

      I have another app that builds a 500K record table and it builds it from 1,000 input files. Takes about 1/2 the time or about 90 seconds. Its not worth my programming effort to emit an intermediate file in the whatever the bulk import utility needs - I just put the data into the DB right away. A reasonable programming tradeoff. Mileage varies.

      It should be noted that my machine is an older one, a hyper threaded one (before the multi-core days), the Prescott stepping - last one with PGA (pin grid array) and my disks are only 5K rpm (not 7K+). A more modern machine can run a single process at least 4x this fast or about 45 seconds instead of 3 minutes (I've bench marked my machine vs a friend's on similar tasks before).

      The time scales linearly, so 11M records would take 10x as long. Is that "ridiculously slow?", I don't know. I guess that depends upon the application.

      I do have a MySQL server running on my machine and in the past I've done some benchmarking vs SQLite. For a complicated query, MySQL is faster, but for my current projects, SQLite "wins" due to admin simplicity (none required!).