Re: Large data processing ...

I think the main problem starts before the code in the snippets is reached.

The way in which the start and end positions are stored is very strange. Why put them into a hash that way? The only way I could think that this was being done, assuming that they are being found by a regex search, is something similar to this

my %endpoints;
$endpoints{ $-[0] } = $+[0] 
    while $chromosome =~ m[(CG.*?AT)]g;
[download]

This could easily (and efficiently) be changed to

my %endpoints;
$endpoints{ $-[0] - 1 } = $+[0] - $-[0] - 1
    while $chromosome =~ m[(CG.*?AT)]g;

...
while( my( $begin, $end ) =  each %exon_endpoints ) {
    print substr( $chromosome, $begin, $end ), "\n\n";
}
[download]

thus removing the need to call a user subroutine at all without loss of clarity.

That said, I still think that storing the position information this way is a bad idea as it means that the exons are printed out in a random order with the only correction being to build two lists of the keys in order to sort them prior to printing.

I think a better way would be to build an array of LVALUE refs to the exons.

my @exons;
push @exons, eval "\\substr( \$chromosome, 
                   $-[0] - 1,
                   $+[0] - $-[0] - 1 )"
    while $chromosome =~ m[(CG.*?AT)]g;

...
for( @exons ) {
    print $$_, "\n\n";
}
[download]

This builds an array of LVALUE refs directly into $chromosome. The array use much less memory than the equivalent hash, each element being just an SV plus a little magic.

The for loop efficiently processes the exons, in their original order without the need to sort, without creating any long lists thanks to fors iterator magic, without any copying of strings, and without the need to call a user sub to achieve clarity.

The need to use eval will slow things down when building the list, but that will probably be offset by avoiding hashing and is only needed because of a bug. This was fixed in 5.8.1 and so that expense would be avoided also.

I realise that I have changed the ground rules somewhat and made some assumptions about how the hash was being generated, but often the best way to optimise a slow process is to stand back and look not at where the slowness is manifested, but instead look at where the roots of the slowness are generated.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!