Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a code which compares multiple datafiles by storing them in hash. It works well with low amount of data but with large data the program gets killed. I think if we can store the files somewhere else, compare and get the results, it can save much of the memory and work with any amount of data. please suggest me how to do this so that even if very large amount of data is used, the script can be executed and gives results

  • Comment on storing hash in temporary files to save memory usage

Replies are listed 'Best First'.
Re: storing hash in temporary files to save memory usage
by haukex (Archbishop) on Sep 20, 2016 at 09:37 UTC

    Hi Anonymous,

    Showing us a short, working piece of code that is representative of your code and some example data would help us in giving specific advice. See also Short, Self Contained, Correct Example and How do I post a question effectively? Also, how much is the low amount and how much is the large amount of data?

    In general, there are certainly other approaches to the issue. More efficient algorithms or more efficient data storage in memory, hashes/arrays tied to files on disk, databases, etc. But which is best depends on your specific circumstances.

    Regards,
    -- Hauke D

Re: storing hash in temporary files to save memory usage
by BrowserUk (Patriarch) on Sep 20, 2016 at 13:28 UTC

    There are many file-based hashing modules, eg. BerkeleyDB::Hash, mod:/DB_File, DBM::Deep etc. But each have their own strengths and weaknesses depending upon how you need to use them.

    Which, if any of them, is right for your application depends very much on:

    1. the nature of your data:

      Are these simple key/value pairs or can the values themselves also be hashes or arrays?

    2. the nature and pattern of usage:

      Is this a one-off thing? Ie. do you build the filed hashes once, run your processing, and then discard those files?

      Or do they get reused many times?

    3. Will the filed hashes be accessed from a single process?

      Or multiple concurrent processes?

    4. Are the processes accessing these hashes long running processes that open the files once and then do lots of processing?

      Or are they short lived processes (eg. webserver sessions) that open the file, access one or two keys and then close them again?

    If you give us a clearer picture of the nature of the nature of the data and the processes accessing it, we could probably give you far better suggestions for which modules or methods most likely fit your needs.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: storing hash in temporary files to save memory usage
by Your Mother (Archbishop) on Sep 20, 2016 at 12:55 UTC

    Couple good answers already. I recommend searching the CPAN for caching, especially dedicated caches or file caches—file cache. I like CHI as a unified front end to many caches. Caches are great, some are robust (some aren’t designed to be), they will all be slower than a plain hash (some much slower).

Re: storing hash in temporary files to save memory usage
by hippo (Archbishop) on Sep 20, 2016 at 09:44 UTC

    Have you tried tieing the hashes to files (eg. with Tie::Hash)? If so, please elaborate on precisely how this failed to meet your needs.

Re: storing hash in temporary files to save memory usage
by Laurent_R (Canon) on Sep 20, 2016 at 22:25 UTC
    A couple of comments.

    I have yet to see a case where a hash too large for memory can be managed within a reasonable time frame with tied hash solutions. Tied hashes are just too slow for very large data. If it gets that big, then a database is probably a better solution.

    Is your size problem due to having just very large files, or is it because you have a very large number of relatively large files? The possible solution would be very different.

    I am regularly comparing truly huge files (hundreds of millions of records), that cannot fit into memory. In my experience (having tried many many things), the fastest way is to use the system sort utility to sort them according to the comparison key and then to read two files sequentially in parallel. The comparison routine is slightly tricky (to make sure you stay in sync), but nothing insurmountable.

      I agree. Which is why I created Tie::Hash::DBD to make that transition easy. Works fantastic with DBD::SQLite.

      Here is a speed comparison table for Tie methods. Higher is better:

      Size op Native GDBM NDBM ODBM SDBM DB +_File CDB_File Berkeley SQLite Pg mysql C +SV ------ -- ---------- ---------- ---------- ---------- ---------- ----- +----- ---------- ---------- ---------- ---------- ---------- -------- +-- 20 rd 1428571.4 119047.6 84388.2 73800.7 91743.1 46 +403.7 512820.5 25031.3 17226.5 3287.3 1573.4 1003 +.0 20 wr 1428571.4 80971.7 59171.6 50377.8 42553.2 45 +766.6 425531.9 24600.2 12586.5 1481.4 361.0 1079 +.7 200 rd 2739726.0 142959.3 110132.2 93985.0 105652.4 57 +954.2 947867.3 32631.8 22634.7 7008.2 1574.5 203 +.4 200 wr 2409638.6 151285.9 100200.4 86355.8 51533.1 75 +358.0 943396.2 41999.2 15764.2 2941.3 961.0 375 +.2 600 rd 2597402.6 145384.1 118506.8 106345.3 122299.2 66 +430.5 823045.3 40952.8 30453.8 11594.7 646.6 + - 600 wr 2409638.6 167597.8 117947.7 101078.2 53922.9 74 +915.7 840336.1 46285.6 19183.4 6221.4 692.4 + -

      And a wider table with more tests (Redis) and Oracle over a network connection to show the difference


      Enjoy, Have FUN! H.Merijn

        I am sorry but unable to understand how to use it with my script.

      Update 1: Added B+ tree results for DB_File, BerkeleyDB, and TokyoCabinet.
      Update 2: Added results for in-memory consumption and hash databases.
      Update 3: See Kyoto Tycoon key-value store (and the underlying Kyoto Cabinet library).
      Update 4: Resolved issue with Kyoto Cabinet (tree) failing random fetch.

      Regarding DBM files, I'm not aware of anything faster than Kyoto Cabinet, successor of Tokyo Cabinet. Sorting isn't necessary when storing into a B+ tree database. The .kct extension will have records organized using a B+ tree database. Once key-value pairs are stored, the performance of sequential access is much faster than that of random access.

      Testing was done on a Macbook Pro, late 2013 i7-Haswell @ 2.6 GHz, using Perl 5.26.0. The CPU TurboBoost may run as high as 3.8 GHz on one core. Unfortunately, I do not have anything slower to run on. The take from this is that Kyoto Cabinet is fastest and smallest of the bunch.

      use strict; use warnings; use BerkeleyDB; use DB_File; use TokyoCabinet; use KyotoCabinet; unlink qw( /tmp/file.db /tmp/file.tch /tmp/file.kch ); unlink qw( /tmp/file.tct /tmp/file.kct ); # -- # my $ob = tie my %hash, 'BerkeleyDB::Hash', # -Filename => '/tmp/file.db', -Flags => DB_CREATE # or die "open error: $!"; # # my $ob = tie my %hash, 'BerkeleyDB::Btree', # -Filename => '/tmp/file.db', -Flags => DB_CREATE # or die "open error: $!"; # # my $ob = tie my %hash, 'DB_File', # '/tmp/file.db', O_RDWR|O_CREAT, 0644, $DB_HASH # or die "open error: $!"; # # my $ob = tie my %hash, 'DB_File', # '/tmp/file.db', O_RDWR|O_CREAT, 0644, $DB_BTREE # or die "open error: $!"; # # my $ob = tie my %hash, 'TokyoCabinet::HDB', '/tmp/file.tch', # TokyoCabinet::HDB::OWRITER | TokyoCabinet::HDB::OCREAT # or die "open error: $!"; # # my $ob = tie my %hash, 'TokyoCabinet::BDB', '/tmp/file.tcb', # TokyoCabinet::BDB::OWRITER | TokyoCabinet::BDB::OCREAT # or die "open error: $!"; # # my $ob = tie my %hash, 'KyotoCabinet::DB', '/tmp/file.kch', # KyotoCabinet::DB::OWRITER | KyotoCabinet::DB::OCREATE # or die "open error: $!"; # my $ob = tie my %hash, 'KyotoCabinet::DB', '/tmp/file.kct', KyotoCabinet::DB::OWRITER | KyotoCabinet::DB::OCREATE or die "open error: $!"; # -- # Tie interface : 23.875 seconds, 793 MiB - BerkeleyDB::Btree # 19.812 seconds, 793 MiB - DB_File $DB_BTREE # 19.208 seconds, 353 MiB - TokyoCabinet *.tcb # 17.232 seconds, 306 MiB - KyotoCabinet *.kct # # for ( 1 .. 10e6 ) { # $hash{$_} = "$_ some string..."; # } # # OO interface : 82.573 seconds, 639 MiB - BerkeleyDB::Hash # 73.383 seconds, 639 MiB - DB_File $DB_HASH # 87.695 seconds, 458 MiB - TokyoCabinet *.tch # 38.312 seconds, 464 MiB - KyotoCabinet *.kch # # 19.899 seconds, 793 MiB - BerkeleyDB::Btree # 14.340 seconds, 793 MiB - DB_File $DB_BTREE # 14.763 seconds, 353 MiB - TokyoCabinet *.tcb # 10.970 seconds, 306 MiB - KyotoCabinet *.kct # for ( 1 .. 10e6 ) { $ob->STORE($_ => "$_ some string..."); }

      For Mac users, Kyoto Cabinet requires patching 3 files, found here. The macports file is found here. Tokyo Cabinet builds fine without manual intervention (not shown below). Finally, the Perl driver. The documentation can be found under the doc dir.

      $ tar xf $HOME/Downloads/kyotocabinet-1.2.76.tar.gz $ cd kyotocabinet-1.2.76 $ patch -p0 < $HOME/Downloads/patch-kccommon.h.diff $ patch -p0 < $HOME/Downloads/patch-configure.diff $ patch -p0 < $HOME/Downloads/patch-kcthread.cc $ ./configure --disable-lzo --disable-lzma $ make -j2 $ sudo make install
      $ tar xf $HOME/Downloads/kyotocabinet-perl-1.20.tar.gz $ cd kyotocabinet-perl-1.20 $ perl Makefile.PL $ sudo make install $ cd doc $ open index.html

      One may run entirely from memory. Simply replace the filename with '*' for a cache hash database or '%' for a cache tree database. The memory footprint is less than half compared to Perl's native hash. Kyoto Cabinet consumes 4.2x smaller in-memory footprint versus a plain hash. Storing key-value pairs doesn't take longer either.

      my $ob1 = tie my %h1, 'TokyoCabinet::ADB', '*'; # in-memory hash my $ob2 = tie my %h2, 'TokyoCabinet::ADB', '+'; # in-memory tree my $ob3 = tie my %h3, 'KyotoCabinet::DB', '*'; # in-memory hash my $ob4 = tie my %h4, 'KyotoCabinet::DB', '%'; # in-memory tree
      use strict; use warnings; use Time::HiRes 'time'; use TokyoCabinet; use KyotoCabinet; my %hash; my $ob2 = tie my %h2, 'TokyoCabinet::ADB', '+'; # in-memory tree my $ob4 = tie my %h4, 'KyotoCabinet::DB', '%'; # in-memory tree my $start = time; # Plain hash 1911 MiB, 10.182 seconds # for ( 1 .. 10e6 ) { # $hash{$_} = "$_ some string..."; # } # Tokyo Cabinet 627 MiB, 10.165 seconds # for ( 1 .. 10e6 ) { # $ob2->STORE($_ => "$_ some string..."); # } # Kyoto Cabinet 453 MiB, 10.062 seconds for ( 1 .. 10e6 ) { $ob4->STORE($_ => "$_ some string..."); } printf {*STDERR} "capture memory consumption in top: %0.03f\n", time - $start; 1 for ( 1 .. 2e8 );

      For some unknown reason, accessing an in-memory B+ tree database randomly is taking a long time with Kyoto Cabinet that I stopped the script after 40 seconds. Thus, compared the in-memory hash database instead. Appending the pccap=256m option resolved the issue. That increases the default page cache memory to 256 MiB.

      use strict; use warnings; use List::Util 'shuffle'; use Time::HiRes 'time'; use TokyoCabinet; use KyotoCabinet; srand 0; my %hash; my $ob2 = tie my %h2, 'TokyoCabinet::ADB', '+'; # Tree my $ob4 = tie my %h4, 'KyotoCabinet::DB', '%#pccap=256m'; # Tree my $size = 5e6; my $start; my @keys = shuffle 1 .. $size; # plain hash 4.342 seconds # for ( 1 .. $size ) { # $hash{$_} = "$_ some string..."; # } # $start = time; # for ( @keys ) { # my $v = $hash{$_}; # } # TokyoCabinet 11.572 seconds '+' tree # TokyoCabinet 8.936 seconds '*' hash # for ( 1 .. $size ) { # $ob2->STORE($_ => "$_ some string..."); # } # $start = time; # for ( @keys ) { # my $v = $ob2->FETCH($_); # } # KyotoCabinet 11.991 seconds '%' tree # KyotoCabinet 6.087 seconds '*' hash for ( 1 .. $size ) { $ob4->STORE($_ => "$_ some string..."); } $start = time; for ( @keys ) { my $v = $ob4->FETCH($_); } printf "duration: %0.03f seconds\n", time - $start;

      See this page for specific tuning parameters. Particularly #pccap=256m for tree databases and the #capsiz option for in-memory hash databases. Likewise, the #bnum option for tuning the number of buckets (should be set to about twice the number of expected keys). Append options to the filename argument.

      "/tmp/file.kch#bnum=5000000" # hash "/tmp/file.kct#pccap=256m" # tree "*#bnum=5000000#capsiz=1024m" # in-memory hash "%#pccap=256m" # in-memory tree

      What I've learned during this experience is that one must try both hash and B+ tree databases. Depending on the application, it may favor one over the other.

      Regards, Mario

        Thanks a lot, Mario, for this very interesting information. I'll give it a try.

      I have the following code to compare the sequence in multiple large files. The below code works fine with some files but I want it to execute for any number of files however large it may be. I tried executing it with more than 10GB data but the program gets killed. Example of a file is shown below. The count is summed only if the second line of each set matches in both files. I want to give the sum of those set whose second line matches in all the files.

      data1.txt @NS500278 AGATCNGAA + =CCGGGCGG 1 @NS500278 TACAGNGAG + CCCGGGGGG 2 @NS500278 CATTGNACC + CCCGGGGGG 3 data2.txt @NS500278 AGATCNGAA + =CCGGGCGG 1 @NS500278 CATTGNACC + CCCG#GGG# 2 @NS500278 TACAGNGAG + CC=GGG#GG 2 output: @NS500278 AGATCNGAA + =GGGGGCCG 1:data1.txt.out 1:data2.txt.out count:2 @NS500278 CATTGNACC + CCCGGGGGG 3:data1.txt.out 2:data2.txt.out count:5 @NS500278 TACAGNGAG + CCCGGGGGG 2:data1.txt.out 2:data2.txt.out count:4
      My code is:
      #!/usr/bin/env perl use strict; use warnings; my %seen; $/ = ""; while (<>) { chomp; my ($key, $value) = split ('\t', $_); my @lines = split /\n/, $key; my $key1 = $lines[1]; $seen{$key1} //= [ $key ]; push (@{$seen{$key1}}, $value); } foreach my $key1 ( sort keys %seen ) { my $tot = 0; my $file_count = @ARGV; for my $val ( @{$seen{$key1}} ) { $tot += ( split /:/, $val )[0]; } if ( @{ $seen{$key1} } >= $file_count) { print join( "\t", @{$seen{$key1}}); print "\tcount:". $tot."\n\n"; } }

        I believe I have a solution for your problem that should work with any size files and will finish in less time than it will take to load your data into a RDBMS.

        I need confirmation that all your keys (dna sequences) 1) are 9 characters long? 2) consist of combinations of the 5 characters: ACGTN?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.