I think the main problem starts before the code in the snippets is reached.

The way in which the start and end positions are stored is very strange. Why put them into a hash that way? The only way I could think that this was being done, assuming that they are being found by a regex search, is something similar to this

my %endpoints; $endpoints{ $-[0] } = $+[0] while $chromosome =~ m[(CG.*?AT)]g;

This could easily (and efficiently) be changed to

my %endpoints; $endpoints{ $-[0] - 1 } = $+[0] - $-[0] - 1 while $chromosome =~ m[(CG.*?AT)]g; ... while( my( $begin, $end ) = each %exon_endpoints ) { print substr( $chromosome, $begin, $end ), "\n\n"; }

thus removing the need to call a user subroutine at all without loss of clarity.

That said, I still think that storing the position information this way is a bad idea as it means that the exons are printed out in a random order with the only correction being to build two lists of the keys in order to sort them prior to printing.

I think a better way would be to build an array of LVALUE refs to the exons.

my @exons; push @exons, eval "\\substr( \$chromosome, $-[0] - 1, $+[0] - $-[0] - 1 )" while $chromosome =~ m[(CG.*?AT)]g; ... for( @exons ) { print $$_, "\n\n"; }

This builds an array of LVALUE refs directly into $chromosome. The array use much less memory than the equivalent hash, each element being just an SV plus a little magic.

The for loop efficiently processes the exons, in their original order without the need to sort, without creating any long lists thanks to fors iterator magic, without any copying of strings, and without the need to call a user sub to achieve clarity.

The need to use eval will slow things down when building the list, but that will probably be offset by avoiding hashing and is only needed because of a bug. This was fixed in 5.8.1 and so that expense would be avoided also.

I realise that I have changed the ground rules somewhat and made some assumptions about how the hash was being generated, but often the best way to optimise a slow process is to stand back and look not at where the slowness is manifested, but instead look at where the roots of the slowness are generated.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
Hooray!


In reply to Re: Large data processing ... by BrowserUk
in thread Large data processing ... by dragonchild

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.