azheid has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have an efficiency question. I need to create a super large tab delimited file. 4^7 by 4^7 in dimension. What I have now is a double loop which prints out each element individually in its appropriate location. This file is a symmetric square matrix. I am unsure whether it will take more time to save 0.5 *4^7 *4^7 entries to RAM so that I only calculate once, or whether it would be better to calculate each element twice inside the double loop. If any of the monks are especially bored, I would also appreciate comments on how I could improve the efficiency of the code below. Please forgive me if there are glaring errors, I am not a programmer, just a lowly grad student pretending to be one.

#!/usr/bin/perl use strict; use warnings; my @kmers=<>; chomp @kmers; my $kmer_l=length($kmers[0]); for(my $j=0;$j<@kmers;++$j){ print &map_constraint($kmer_l,&return_constraint($kmers[$j],$kmers +[0])); for(my $i=1;$i<@kmers;++$i){ print "\t",&map_constraint($kmer_l,&return_constraint($kmers[$ +j],$kmers[$i])); } print "\n"; } sub map_constraint{ my $dt=0.0000000001; ##input string length,and edit distance value return (-(($_[1]+$dt)/(($_[0]/2)+$dt)-1)) ##returns a value from ( +-1,1) } sub return_constraint{#string edit distance my ($len1, $len2) = (length $_[0], length $_[1]); if ($len1 == 0){ return $len2; } if ($len2 == 0){ return $len1; } my %hash; for(my $i=0;$i<=$len1;++$i){ for(my $j=0;$j<=$len2;++$j){ $hash{$i}{$j} = 0; $hash{0}{$j} = $j; } $hash{$i}{0}=$i; } my @a=split(//, $_[0]); my @b=split(//, $_[1]); for(my $i=1;$i<=$len1;++$i){ for (my $j=1;$j<=$len2;++$j){ my $cost=($a[$i-1] eq $b[$j-1]) ? 0 : 1; $hash{$i}{$j}=&min([$hash{$i-1}{$j} + 1,$hash{$i}{$j-1} + +1,$hash{$i-1}{$j-1} + $cost]); } } return $hash{$len1}{$len2}; } sub min{ my $min=${$_[0]}[0]; foreach my $elem (@{$_[0]}){ if($elem < $min){ $min=$elem; } } return $min;

Replies are listed 'Best First'.
Re: Efficiency of implementation
by JavaFan (Canon) on Jan 31, 2012 at 19:32 UTC
    I am unsure whether it will take more time to save 0.5 *4^7 *4^7 entries to RAM so that I only calculate once, or whether it would be better to calculate each element twice inside the double loop.
    Judging from your code, it may be the latter. But that's more a guess than anything else. Which one is faster for you is only something you can determine. Try both methods, see which one is faster. It's impossible for us to determine how fast you can claim memory on your OS/hardware, and how your box is going to behave if you're going to use a lot of memory.

      Thanks for the reply. I will only have to do this a handful of times, so I am really looking for an expert best guess.

        So, does your code run now? If it runs within whatever time constraint that you have (and it sounds like even overnight is "fast enough"), why bother with more optimization? I mean if it takes you a day or two of work and the program would have already finished by then, its not a good use of your time. However a 2D array would probably be better that a 2D Hash.

        Update: Is there a way to provide a small, runnable data set? Even a 20x20 matrix or whatever? Ie. some data set that is small enough to post here?

Re: Efficiency of implementation
by BrowserUk (Patriarch) on Jan 31, 2012 at 19:48 UTC
    save 0.5 *4^7 *4^7 entries to RAM

    If every entry were just 1 byte, that would amount to: 745,058 GB or 727 TeraBytes.

    To my knowledge, there is no (micro)processor yet made that can address that much virtual memory, let alone physical ram.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      4^7*4^7 * 4bytes per float =~1.07Gb

      Did I do that computation incorrectly? I have 8 gb of ram. I am pretty sure 1 scalar is 4 bytes, so that should be within the acceptable range?

        Sorry. I took your 4^7 to be shorthand for 4*10^7. My mistake.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

      Color me embarrassed for not checking the actual size of that computation. Sorry to waste your time.

Re: Efficiency of implementation
by Anonymous Monk on Jan 31, 2012 at 22:10 UTC
    Is there anything that you can do, to avoid actually having to store 4^7^2 entries?   Is the matrix sparse .. in other words, are most of the entries zero (or some other known value)?   Any matrix can be expressed as a list ... (x, y, value) ... therefore also as a database table.   Even though, mathematically speaking, the values are a matrix, it might not be physically possible to store it in the computer’s memory as one.
Re: Efficiency of implementation
by Jenda (Abbot) on Feb 01, 2012 at 13:42 UTC

    I did not spend enough time to be sure, but it doesn't look like your matrix is sparse and the indices are integers so ... why hash? An array of arrays would work better and take up less space.

    And actually ... why AoA? I think you should be using PDL

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      I think you should be using PDL

      So do I, but despite several attempts to use it, I still wouldn't know where to start.

      And no one ever seems to post any code with their "use PDL" suggestions :(


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        And I thought it's just me and the fact that I was mildly interested, but never actually had a need to use PDL :-(

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.