Large data processing ...

In a Perl.com article on bioinformatics and Perl, there was an interesting snippet used as a demonstration for different optimization techniques. Now, the optimization technique described was to inline code, if it was small. I'm not disputing that.

while ((my $begin, my $end) =  each %exon_endpoints) {
    print get_exon($chromosome, $begin, $end), "\n\n";
}

sub get_exon {
    my($chromosome, $begin, $end) = @_;

    # The arguments to substr are: string, beginning, length
    return substr($chromosome, $begin - 1, $end - $begin + 1);
}
[download]

becomes ...

while ((my $begin, my $end) =  each %exon_endpoints) {
    print substr($chromosome, $begin - 1, $end - $begin + 1), "\n\n";
}
[download]

Now, it's relatively obvious that the first option is the more readable, and the author says so, as well. The second, given a large enough $chromosome, is much faster.

I was wondering if there were other optimizations than passing by reference (as suggested in an earlier optimization option). The only one I could think of was to use closures, along the lines of:

my $chromosome;
get_chromosome(1, \$chromosome);

my $get_exon = sub {
    substr $chromosome, $_[0] - 1, $_[1] - $_[0] - 1;
};

while (my @ends = each %exon_endpoints)
{
    print $get_exon->(@ends), "\n\n";
}
[download]

------
We are the carpenters and bricklayers of the Information Age.

The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms

Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Comment on Large data processing ... Select or Download Code

Replies are listed 'Best First'.
Re: Large data processing ... by liz (Monsignor) on Oct 16, 2003 at 17:44 UTC
`sub get_exon { return substr($_[0], $_[1] - 1, $_[2] - $_[1] + 1); }` [download] Note that by using `$_[0]` directly, you're accessing `$chromosome` without copying it. This is because the `@_` array is aliased to the parameters passed. So this saves you a reference and a dereference on top of the savings that you would have when passing a reference, while you lose out on readability. Liz	[reply] [d/l] [select]
Re: Re: Large data processing ... by demerphq (Chancellor) on Oct 18, 2003 at 18:53 UTC
This is because the `@_` array is aliased to the parameters passed. Its worth explaining aliasing a little better I think: I like to think of it as a reference that doesn't need dereferencing. Which is essentially how its implemented as well. :-) while you lose out on readability. Well, you could use `for`: `sub get_exon { my $end=pop; my $begin=pop; # alias $_[0] to $chromosome to avoid copy semantics speed penalty for my $chromosome (@_) { return substr($chromosome, $begin - 1, $end - $begin + 1); } } # or perhaps sub get_exon { my ($begin,$end)=@_[1,2]; # alias $_[0] to $chromosome to avoid copy semantics speed penalty for my $chromosome (@_) { return substr($chromosome, $begin - 1, $end - $begin + 1); } }` [download] But of course its not as efficient as the straight forward access to `@_`, but it is worth remembering that `for (LIST)` aliases the iterator var to the values being iterated over. As such this might be a nice middle ground between super-optimised and unreadable, and optimised and relatively readable. :-) --- demerphq _{First they ignore you, then they laugh at you, then they fight you, then you win. -- Gandhi}	[reply] [d/l] [select]
Re: Large data processing ... by perrin (Chancellor) on Oct 16, 2003 at 18:01 UTC
Your closure approach looks about the same as using a global, and I think using a global is a more obvious approach in this case. I doubt there's a significant difference between using a global and using `$_[0]` (an implicit pass-by-reference), so I would use the latter to make the code more readable.	[reply] [d/l]
Re: Large data processing ... by BrowserUk (Patriarch) on Oct 16, 2003 at 22:02 UTC
I think the main problem starts before the code in the snippets is reached. The way in which the start and end positions are stored is very strange. Why put them into a hash that way? The only way I could think that this was being done, assuming that they are being found by a regex search, is something similar to this `my %endpoints; $endpoints{ $-[0] } = $+[0] while $chromosome =~ m[(CG.?AT)]g;` [download] This could easily (and efficiently) be changed to `my %endpoints; $endpoints{ $-[0] - 1 } = $+[0] - $-[0] - 1 while $chromosome =~ m[(CG.?AT)]g; ... while( my( $begin, $end ) = each %exon_endpoints ) { print substr( $chromosome, $begin, $end ), "\n\n"; }` [download] thus removing the need to call a user subroutine at all without loss of clarity. That said, I still think that storing the position information this way is a bad idea as it means that the exons are printed out in a random order with the only correction being to build two lists of the keys in order to sort them prior to printing. I think a better way would be to build an array of LVALUE refs to the exons. `my @exons; push @exons, eval "\\substr( \$chromosome, $-[0] - 1, $+[0] - $-[0] - 1 )" while $chromosome =~ m[(CG.*?AT)]g; ... for( @exons ) { print $$_, "\n\n"; }` [download] This builds an array of LVALUE refs directly into $chromosome. The array use much less memory than the equivalent hash, each element being just an SV plus a little magic. The for loop efficiently processes the exons, in their original order without the need to sort, without creating any long lists thanks to fors iterator magic, without any copying of strings, and without the need to call a user sub to achieve clarity. The need to use eval will slow things down when building the list, but that will probably be offset by avoiding hashing and is only needed because of a bug. This was fixed in 5.8.1 and so that expense would be avoided also. I realise that I have changed the ground rules somewhat and made some assumptions about how the hash was being generated, but often the best way to optimise a slow process is to stand back and look not at where the slowness is manifested, but instead look at where the roots of the slowness are generated. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray!	[reply] [d/l] [select]
Re: Large data processing ... by Abigail-II (Bishop) on Oct 16, 2003 at 20:53 UTC
Well, you use the function `get_exon` to make it more readable, but all you do in that function is manipulate the $begin and $end values. Furthermore, the slowdown comes from the copying of a potential large string: $chromosome. A solution is to make a function that takes just 2 arguments: $begin and $end, and manipulate those. The manipulation doesn't require $chromosome. So: `sub _ {$_ [0] - 1, $_ [1] - $_ [0] - 1} while ((my $begin, my $end) = each %exon_endpoints) { print substr ($chromosome => _ $begin, $end), "\n\n"; }` [download] Of course, that still has the overhead of calling a user defined subroutine for each endpoint. Abigail	[reply] [d/l] [select]