Efficiency of implementation

azheid has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have an efficiency question. I need to create a super large tab delimited file. 4^7 by 4^7 in dimension. What I have now is a double loop which prints out each element individually in its appropriate location. This file is a symmetric square matrix. I am unsure whether it will take more time to save 0.5 *4^7 *4^7 entries to RAM so that I only calculate once, or whether it would be better to calculate each element twice inside the double loop. If any of the monks are especially bored, I would also appreciate comments on how I could improve the efficiency of the code below. Please forgive me if there are glaring errors, I am not a programmer, just a lowly grad student pretending to be one.

#!/usr/bin/perl
use strict;
use warnings;

my @kmers=<>;
chomp @kmers;
my $kmer_l=length($kmers[0]);
for(my $j=0;$j<@kmers;++$j){
    print &map_constraint($kmer_l,&return_constraint($kmers[$j],$kmers
+[0]));
    for(my $i=1;$i<@kmers;++$i){
        print "\t",&map_constraint($kmer_l,&return_constraint($kmers[$
+j],$kmers[$i]));
    }
    print "\n";
}

sub map_constraint{
    my $dt=0.0000000001;
##input string length,and edit distance value
    return (-(($_[1]+$dt)/(($_[0]/2)+$dt)-1)) ##returns a value from (
+-1,1) 
}

sub return_constraint{#string edit distance
    my ($len1, $len2) = (length $_[0], length $_[1]);
    if ($len1 == 0){
        return $len2;
    }
    if ($len2 == 0){
        return $len1;
    }
    my %hash;
    for(my $i=0;$i<=$len1;++$i){
        for(my $j=0;$j<=$len2;++$j){
            $hash{$i}{$j} = 0;
            $hash{0}{$j} = $j;
        }
        $hash{$i}{0}=$i;
    }
    my @a=split(//, $_[0]);
    my @b=split(//, $_[1]);
    for(my $i=1;$i<=$len1;++$i){
        for (my $j=1;$j<=$len2;++$j){
            my $cost=($a[$i-1] eq $b[$j-1]) ? 0 : 1;
            $hash{$i}{$j}=&min([$hash{$i-1}{$j} + 1,$hash{$i}{$j-1} + 
+1,$hash{$i-1}{$j-1} + $cost]);
        }
    }
    return $hash{$len1}{$len2};
}

sub min{
    my $min=${$_[0]}[0];
    foreach my $elem (@{$_[0]}){
        if($elem < $min){
            $min=$elem;
        }
    }
    return $min;
[download]

Comment on Efficiency of implementation Download Code

Replies are listed 'Best First'.
Re: Efficiency of implementation by JavaFan (Canon) on Jan 31, 2012 at 19:32 UTC
I am unsure whether it will take more time to save 0.5 4^7 4^7 entries to RAM so that I only calculate once, or whether it would be better to calculate each element twice inside the double loop. Judging from your code, it may be the latter. But that's more a guess than anything else. Which one is faster for you is only something you can determine. Try both methods, see which one is faster. It's impossible for us to determine how fast you can claim memory on your OS/hardware, and how your box is going to behave if you're going to use a lot of memory.	[reply]
Re^2: Efficiency of implementation by azheid (Sexton) on Jan 31, 2012 at 19:37 UTC
Thanks for the reply. I will only have to do this a handful of times, so I am really looking for an expert best guess.	[reply]
Re^3: Efficiency of implementation by Marshall (Canon) on Jan 31, 2012 at 21:22 UTC
So, does your code run now? If it runs within whatever time constraint that you have (and it sounds like even overnight is "fast enough"), why bother with more optimization? I mean if it takes you a day or two of work and the program would have already finished by then, its not a good use of your time. However a 2D array would probably be better that a 2D Hash. Update: Is there a way to provide a small, runnable data set? Even a 20x20 matrix or whatever? Ie. some data set that is small enough to post here?	[reply]
Re: Efficiency of implementation by BrowserUk (Patriarch) on Jan 31, 2012 at 19:48 UTC
save 0.5 4^7 4^7 entries to RAM If every entry were just 1 byte, that would amount to: 745,058 GB or 727 TeraBytes. To my knowledge, there is no (micro)processor yet made that can address that much virtual memory, let alone physical ram. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^2: Efficiency of implementation by azheid (Sexton) on Jan 31, 2012 at 20:06 UTC
4^74^7 4bytes per float =~1.07Gb Did I do that computation incorrectly? I have 8 gb of ram. I am pretty sure 1 scalar is 4 bytes, so that should be within the acceptable range?	[reply]
Re^3: Efficiency of implementation (Mea culpa) by BrowserUk (Patriarch) on Jan 31, 2012 at 20:14 UTC
Sorry. I took your 4^7 to be shorthand for 4*10^7. My mistake. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^2: Efficiency of implementation by azheid (Sexton) on Jan 31, 2012 at 19:57 UTC
Color me embarrassed for not checking the actual size of that computation. Sorry to waste your time.	[reply]
Re: Efficiency of implementation by Anonymous Monk on Jan 31, 2012 at 22:10 UTC
Is there anything that you can do, to avoid actually having to store 4^7^2 entries? Is the matrix sparse .. in other words, are most of the entries zero (or some other known value)? Any matrix can be expressed as a list ... (x, y, value) ... therefore also as a database table. Even though, mathematically speaking, the values are a matrix, it might not be physically possible to store it in the computer’s memory as one.	[reply]
Re: Efficiency of implementation by Jenda (Abbot) on Feb 01, 2012 at 13:42 UTC
I did not spend enough time to be sure, but it doesn't look like your matrix is sparse and the indices are integers so ... why hash? An array of arrays would work better and take up less space. And actually ... why AoA? I think you should be using PDL Jenda Enoch was right! Enjoy the last years of Rome.	[reply]
Re^2: Efficiency of implementation by BrowserUk (Patriarch) on Feb 01, 2012 at 14:06 UTC
I think you should be using PDL So do I, but despite several attempts to use it, I still wouldn't know where to start. And no one ever seems to post any code with their "use PDL" suggestions :( With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^3: Efficiency of implementation by Jenda (Abbot) on Feb 01, 2012 at 14:57 UTC
And I thought it's just me and the fact that I was mildly interested, but never actually had a need to use PDL :-( Jenda Enoch was right! Enjoy the last years of Rome.	[reply]
Re^4: Efficiency of implementation by BrowserUk (Patriarch) on Feb 01, 2012 at 15:00 UTC