knirirr has asked for the wisdom of the Perl Monks concerning the following question:

Whilst attempting to convert some subroutines into module, I've run into difficulty with one particular feature. @nreg in the code below is used to store the start and end locations of features in a nucleotide sequence, so that another method may be used to determine whether a given base is within any of the defined regions.
sub addregion { my ($start,$end) = @_; if ($start > $end) { ($start,$end) = ($end,$start); } my $done = 0; my $k; my $tempcoding = 0; for ($k=0;$k<@nreg;$k++) { # calculate sum of bases $tempcoding += $nreg[$k][1] - $nreg[$k][0] + 1; # new region starts after old region ends next if ($nreg[$k][1] < $start); # new region ends before old region starts if ($end < $nreg[$k][0]) { splice(@nreg,$k,0,[$start,$end]); $done = 1; last; } # regions overlap $nreg[$k][0] = $start < $nreg[$k][0] ? $start : $nreg[$k][0]; $nreg[$k][1] = $end > $nreg[$k][1] ? $end : $nreg[$k][1]; $done = 1; # Combine two regions if the new section has brough about an o +verlap while ($k < $#nreg and $nreg[$k][1] > $nreg[$k+1][0]) { $nreg[$k][1] = $nreg[$k+1][1]; splice(@nreg,$k+1,1); } last; } unless ($done) { push @nreg,[$start,$end]; } }
Several (hundreds, even) sequences may be read, and it's apparently necessary to define a new object relating @nreg to the name of each sequence, as I find that the values for all sequences are pushed into the same array. Can anyone suggest a suitable way to set up a "new" method to relate a different @nreg to a particular sequence name, please? I've tried various permutations, which either use the same array every time or refuse to bless it (e.g. "Attempt to bless into a reference at..." messages appear). Thanks for any suggestions!

Replies are listed 'Best First'.
Re: Array of arrays used within object
by esh (Pilgrim) on Aug 06, 2003 at 16:56 UTC
    I'm not sure I understand exactly what you mean by "convert subroutines into modules". It looks like one problem you have is the fact that the subroutine "addregion" is accessing the variable @nreg which has a scope wider than just the subroutine (e.g., a global variable). You might change the code to pass this array in, probably by reference.
    sub addregion { my ($start,$end,$nreg) = @_; : for ($k=0;$k<@$nreg;$k++) :
    You ask for a way to "relate a different @nreg to a particular sequence name". This might be done by turning @nreg from an array into a hash of arrays, where the hash key is the sequence name.
    sub addregion { my ($start,$end,$sequence) = @_; : for ($k=0;$k<@{$nreg{$sequence}};$k++) : $tempcoding += $nreg{$sequence}[$k][1] - $nreg{$sequence}[$k][0] + 1;
    My primary suggestion would be to try simplify the subroutine so that you pass in all parameters that it uses to operate. This reduces the external interactions and side effects and makes the entire code base simpler to deal with when you have to make changes.

    It's a bit difficult to offer suggestions without seeing the use of nreg in the rest of the code, and the "bless" error doesn't seem to relate to any of the code presented so that confuses me. Is the code sample you presented the original code or the code after you have started the conversion?

    Hmmm... After reading the post again, I'm wondering if maybe @nreg has one entry per sequence already, and you're trying to invert the structure of the data to have each sequence be a top leve object with the sub-array of @nreg[$k] be stored in the object.

Re: Array of arrays used within object
by MidLifeXis (Monsignor) on Aug 06, 2003 at 17:08 UTC

    Is each sequence an object? If so, make addregion() a method, and have it store \@nreg on the object as well.

    Otherwise, what about using something like this (pseudocode)...

    my %nregs; foreach <$sequences> { local @nreg; do_addregion_for_all_regions(); $nregs{getseqname($_)} = \@nreg; }

    This way, you have a set of @nreg, one for each sequence, keyed on the sequence name.

    If I am way off base here, no problem. It has been a while since I have done anything biology.

      Apologies for the confusion, and thanks for the suggestions. I'll attempt to clarify.

      >I'm not sure I understand exactly what you mean
      >by "convert subroutines into modules".

      >the variable @nreg which has a scope wider
      >than just the subroutine (e.g., a global variable).

      Originally, I had two subroutines, "addregion" and "isitin". The former is as I posted, and the latter looked like this:
      sub isitin { my ($hit,$what) = @_; my $within = 0; my $k; for ($k=0; $k<@nreg; $k++) { if ($what >= $nreg[$k][0] and $what <= $nreg[$k][1]) { $within = 1; } } return $within; }
      ...so that I could get a "1" back if a given base was within any of the regions defined with nreg. As this was all within one script, it was easy to use @nreg globally and re-set it at the top of a "while while (my $seq = $filein->next_seq()) " loop.
      However, I'd like to re-use the code by creating an object for each sequence, that would contain the sequence id and the @nreg array. The hard bit is working out how to set up that object - hence all manner of bless errors as I attempt to comprehend the complexities of OO (not easy for an old shell script writer ;-).

      BTW the code above is lifted directly from the original subroutines - hopefully I can keep the changes minimal and add some extra methods. It came about to solve this problem.

      Essentially, I looped through each sequence, called "addregion" on each feature, then "isitin" on related features later in the loop. @nreg was re-set for the next genome. I can't find any code in the bioperl that I'm using that does this job, hence the attempt to write my own object for it.
        Quick and easy optimization (not considering the other changes you are asking about): Replace
        $within = 1;
        with
        return 1;
        From your coded logic, there is no reason to continue looking through the rest of the array if you have found a match. You can also drop the declaration of $within and end the sub with
        return 0;
        since no match was found if it finishes the loop.

        In the interests of getting some data, I now have a crude substitute for the ideal code (I'll work on that when deadlines are less pressing). It's like this:

        package msatminer; sub zap { @nreg = (); } # rest of package essentially the same as posted above...

        Thus, calling a “msatminer::zap” before opening each genome file, then running the other methods on the starts and stops, and saving the results before the next “zap” I can at least get valid results out.

        However, if this is somehow very, very naughty in a way I haven't realised, then please tell me ;-)

        Thanks.