Re^6: Some code optimization

Replies are listed 'Best First'.
Re^7: Some code optimization by graff (Chancellor) on Jun 18, 2010 at 10:20 UTC
To conclude, I agree that "is_contained()" consumes more time, but I suggest that we first assert that the first, much simpler function is "optimal" Suit yourself, but to me your statement is a clear example of misplaced priorities, bordering on irrational obsession. The few changes I suggested save an amount of time equal to the total consumed by that first function. Since I'm new to perl optimization... ... or to optimization in general? If you want to focus on optimizing that first function, you should be coding it in C (and that'll be a waste of time if you don't fix the second function, and/or move stuff from inner to outer loops, or just work out a better approach from first principles). UPDATE: BTW, did you happen to try moving the logic of that first function into the caller? (ie. do that stuff in-line in "scenario()" rather than as a sub call to "gene_to_legal_range") -- that might trim your 16 sec. test case down to 13 (about 20%, which is respectable).	[reply]
Re^8: Some code optimization by roibrodo (Sexton) on Jun 18, 2010 at 10:32 UTC
My goal is to optimize both. Since I'm new to this aspect of perl I simply suggested it would be more beneficial to start with the simpler optimization, than go the more complex one. I guess our approaches are different. I am a student and currently my ultimate goal here is to learn. I think this is achieved better going from simpler things to more complex ones. According to your approach, if I were given a black box that solves my problem in a split of a second, I should have used it and be over with it. This will solve the problem obviously, but will teach me nothing. Anyway, I'm sorry to hear that you think I'm "bordering on irrational obsession". This was actually quite offending. UPDATE: re. your update, I did. Quite surprisingly, it now takes significantly longer to run. I guess I have some error, but I can't see it. Inline version: use strict; use warnings; use List::Util qw(max min); use Time::HiRes qw(time); # this builds a structure that is usually retrieved from disk. # in this example we will use this structure again and again, # but in the real program we obviously retrieve a fresh structure # at each iteration my $simulation_h = {}; for ( 1 .. 70000 ) { my $random_start = int( rand(5235641) ); my $random_length = int( rand(40000) ); push @{ $simulation_h->{$random_start} }, $random_length; } my $zone_o = { _chromosome_length => 5235641, _legal_range => [ { FROM => 100000, TO => 200000 } ] }; my $start_time = time; scenario(); print "total loop time: " . ( time - $start_time ) . " seconds\n"; my $temp_gene_to_legal_range; my $gene_to; sub scenario { for ( my $i = 0 ; $i < 50 ; $i++ ) { print "i=$i time=" . ( time - $start_time ) . " seconds\n"; # originally there was a retreive of $simulation_h from disk h +ere # iterate genes foreach my $gene_from ( keys %{$simulation_h} ) { foreach my $gene_length ( @{ $simulation_h->{$gene_from} } + ) { ### inlining gene_to_legal_range $gene_to = ( ( $gene_from + $gene_length - 1 ) % ( $zone_o->{_chromosome_length} ) ) + 1; if ( $gene_to < $gene_from ) { # split # low range first $temp_gene_to_legal_range = [ { FROM => 0, TO => $gene_to }, { FROM => $gene_from, TO => $zone_o->{_chromosome_length} } ]; } else { # single $temp_gene_to_legal_range = [ { FROM => $gene_from, TO => $gene_to } ]; } } } } } [download] 21 seconds Previous version (with subroutine call): use strict; use warnings; use List::Util qw(max min); use Time::HiRes qw(time); # this builds a structure that is usually retrieved from disk. # in this example we will use this structure again and again, # but in the real program we obviously retrieve a fresh structure # at each iteration my $simulation_h = {}; for ( 1 .. 70000 ) { my $random_start = int( rand(5235641) ); my $random_length = int( rand(40000) ); push @{ $simulation_h->{$random_start} }, $random_length; } my $zone_o = { _chromosome_length => 5235641, _legal_range => [ { FROM => 100000, TO => 200000 } ] }; my $start_time = time; scenario(); print "total loop time: " . ( time - $start_time ) . " seconds\n"; my $temp_gene_to_legal_range; my $gene_to; sub scenario { for ( my $i = 0 ; $i < 50 ; $i++ ) { print "i=$i time=" . ( time - $start_time ) . " seconds\n"; # originally there was a retreive of $simulation_h from disk h +ere # iterate genes foreach my $gene_from ( keys %{$simulation_h} ) { foreach my $gene_length ( @{ $simulation_h->{$gene_from} } + ) { #### inlining gene_to_legal_range # $gene_to = # ( ( $gene_from + $gene_length - 1 ) # % ( $zone_o->{_chromosome_length} ) ) + 1; # # if ( $gene_to < $gene_from ) { # # # split # # low range first # $temp_gene_to_legal_range = [ # { FROM => 0, TO => $gene_to }, # { # FROM => $gene_from, # TO => $zone_o->{_chromosome_length} # } # ]; # } # else { # # # single # $temp_gene_to_legal_range = # [ { FROM => $gene_from, TO => $gene_to } ]; # } # # next; #### $temp_gene_to_legal_range = gene_to_legal_range( $gene_from, $gene_length, $zone_o->{_chromosome_length} ); } } } } sub gene_to_legal_range($$$) { return; my ( $gene_from, $gene_length, $legal_length ) = @_; my $ret; my $gene_to = ( ( $gene_from + $gene_length - 1 ) % ($legal_length +) ) + 1; if ( $gene_to < $gene_from ) { # split # low range first $ret = [ { FROM => 0, TO => $gene_to }, { FROM => $gene_from, TO => $legal_length } ]; } else { # single $ret = [ { FROM => $gene_from, TO => $gene_to } ]; } return $ret; } [download] 7.5 seconds	[reply] [d/l] [select]
Re^9: Some code optimization by graff (Chancellor) on Jun 18, 2010 at 11:54 UTC
You are comparing "running the logic of the function in-line" vs. "calling a function that does not execute the logic". That's not the comparison that matters. Comment out the "return" at the top of the function in the latter version, and the timing shows the function call to be more expensive than the inline code. Another minor detail: in your inline version, you do two hash lookups for the value of `$zone_o->{_chromosome_length}` whereas the function call version does only one lookup. If you change the inline version to assign that value to a "my" variable, and use that variable twice (just like it was used twice in the function call), you'll see a reduction of a few sec -- i.e. the improvement over the function call will be more evident. Now, get back to the part that really matters.	[reply] [d/l]
Re^10: Some code optimization by roibrodo (Sexton) on Jun 18, 2010 at 12:12 UTC
Re^9: Some code optimization by graff (Chancellor) on Jun 18, 2010 at 10:48 UTC
Sorry, that wasn't meant as a personal insult -- it was an exaggerated comment about the thought pattern (which I recognize, because I fall prey to it myself, even after 25 years of professional programming in numerous languages). It was just a bit jarring for me to see the contradiction laid out so plainly: you are worried (appropriately) about how long it'll take for your script to run on real data and you want to learn how to fix it, but you'll put off the work of addressing the most costly part of the algorithm, because you want to do the simpler part first? I guess that's okay if you want to remain a student indefinitely. But in terms of holding a job, that approach is not just "different"... Anyway, good luck.	[reply]

To conclude, I agree that "is_contained()" consumes more time, but I suggest that we first assert that the first, much simpler function is "optimal"

Suit yourself, but to me your statement is a clear example of misplaced priorities, bordering on irrational obsession. The few changes I suggested save an amount of time equal to the total consumed by that first function.

Since I'm new to perl optimization...

... or to optimization in general? If you want to focus on optimizing that first function, you should be coding it in C (and that'll be a waste of time if you don't fix the second function, and/or move stuff from inner to outer loops, or just work out a better approach from first principles).

UPDATE: BTW, did you happen to try moving the logic of that first function into the caller? (ie. do that stuff in-line in "scenario()" rather than as a sub call to "gene_to_legal_range") -- that might trim your 16 sec. test case down to 13 (about 20%, which is respectable).

[reply]

I guess our approaches are different. I am a student and currently my ultimate goal here is to learn. I think this is achieved better going from simpler things to more complex ones. According to your approach, if I were given a black box that solves my problem in a split of a second, I should have used it and be over with it. This will solve the problem obviously, but will teach me nothing. Anyway, I'm sorry to hear that you think I'm "bordering on irrational obsession". This was actually quite offending.

UPDATE: re. your update, I did. Quite surprisingly, it now takes significantly longer to run. I guess I have some error, but I can't see it.

Inline version:

use strict;
use warnings;
use List::Util qw(max min);
use Time::HiRes qw(time);

# this builds a structure that is usually retrieved from disk.
# in this example we will use this structure again and again,
# but in the real program we obviously retrieve a fresh structure
# at each iteration

my $simulation_h = {};
for ( 1 .. 70000 ) {
    my $random_start  = int( rand(5235641) );
    my $random_length = int( rand(40000) );
    push @{ $simulation_h->{$random_start} }, $random_length;
}

my $zone_o = {
    _chromosome_length => 5235641,
    _legal_range       => [ { FROM => 100000, TO => 200000 } ]
};

my $start_time = time;
scenario();
print "total loop time: " . ( time - $start_time ) . " seconds\n";
my $temp_gene_to_legal_range;
my $gene_to;

sub scenario {
    for ( my $i = 0 ; $i < 50 ; $i++ ) {
        print "i=$i time=" . ( time - $start_time ) . " seconds\n";

        # originally there was a retreive of $simulation_h from disk h
+ere

        # iterate genes

        foreach my $gene_from ( keys %{$simulation_h} ) {
            foreach my $gene_length ( @{ $simulation_h->{$gene_from} }
+ ) {

### inlining gene_to_legal_range
                $gene_to =
                  ( ( $gene_from + $gene_length - 1 )
                    % ( $zone_o->{_chromosome_length} ) ) + 1;

                if ( $gene_to < $gene_from ) {

                    # split
                    # low range first
                    $temp_gene_to_legal_range = [
                        { FROM => 0, TO => $gene_to },
                        {
                            FROM => $gene_from,
                            TO   => $zone_o->{_chromosome_length}
                        }
                    ];
                }
                else {

                    # single
                    $temp_gene_to_legal_range =
                      [ { FROM => $gene_from, TO => $gene_to } ];
                }

            }
        }
    }
}
[download]

Previous version (with subroutine call):

use strict;
use warnings;
use List::Util qw(max min);
use Time::HiRes qw(time);

# this builds a structure that is usually retrieved from disk.
# in this example we will use this structure again and again,
# but in the real program we obviously retrieve a fresh structure
# at each iteration

my $simulation_h = {};
for ( 1 .. 70000 ) {
    my $random_start  = int( rand(5235641) );
    my $random_length = int( rand(40000) );
    push @{ $simulation_h->{$random_start} }, $random_length;
}

my $zone_o = {
    _chromosome_length => 5235641,
    _legal_range       => [ { FROM => 100000, TO => 200000 } ]
};

my $start_time = time;
scenario();
print "total loop time: " . ( time - $start_time ) . " seconds\n";
my $temp_gene_to_legal_range;
my $gene_to;

sub scenario {
    for ( my $i = 0 ; $i < 50 ; $i++ ) {
        print "i=$i time=" . ( time - $start_time ) . " seconds\n";

        # originally there was a retreive of $simulation_h from disk h
+ere

        # iterate genes

        foreach my $gene_from ( keys %{$simulation_h} ) {
            foreach my $gene_length ( @{ $simulation_h->{$gene_from} }
+ ) {

#### inlining gene_to_legal_range
#                $gene_to =
#                  ( ( $gene_from + $gene_length - 1 )
#                    % ( $zone_o->{_chromosome_length} ) ) + 1;
#
#                if ( $gene_to < $gene_from ) {
#
#                    # split
#                    # low range first
#                    $temp_gene_to_legal_range = [
#                        { FROM => 0, TO => $gene_to },
#                        {
#                            FROM => $gene_from,
#                            TO   => $zone_o->{_chromosome_length}
#                        }
#                    ];
#                }
#                else {
#
#                    # single
#                    $temp_gene_to_legal_range =
#                      [ { FROM => $gene_from, TO => $gene_to } ];
#                }
#
#                next; ####
                
                $temp_gene_to_legal_range =
                  gene_to_legal_range( $gene_from, $gene_length,
                    $zone_o->{_chromosome_length} );

            }
        }
    }
}

sub gene_to_legal_range($$$) {
    return;
    my ( $gene_from, $gene_length, $legal_length ) = @_;

    my $ret;
    my $gene_to = ( ( $gene_from + $gene_length - 1 ) % ($legal_length
+) ) + 1;

    if ( $gene_to < $gene_from ) {

        # split
        # low range first
        $ret = [
            { FROM => 0,          TO => $gene_to },
            { FROM => $gene_from, TO => $legal_length }
        ];
    }
    else {

        # single
        $ret = [ { FROM => $gene_from, TO => $gene_to } ];
    }

    return $ret;
}
[download]

[reply]
[d/l]
[select]

Another minor detail: in your inline version, you do two hash lookups for the value of $zone_o->{_chromosome_length} whereas the function call version does only one lookup. If you change the inline version to assign that value to a "my" variable, and use that variable twice (just like it was used twice in the function call), you'll see a reduction of a few sec -- i.e. the improvement over the function call will be more evident.

Now, get back to the part that really matters.

[reply]
[d/l]

It was just a bit jarring for me to see the contradiction laid out so plainly: you are worried (appropriately) about how long it'll take for your script to run on real data and you want to learn how to fix it, but you'll put off the work of addressing the most costly part of the algorithm, because you want to do the simpler part first?

I guess that's okay if you want to remain a student indefinitely. But in terms of holding a job, that approach is not just "different"... Anyway, good luck.

[reply]