in reply to in need of wisdom: handling DNA strings

You are actually very close on this one ... While there are a few errors in the code snippet provided, you are on the right track with the use of substr function.

The key to the solution for this problem (as far as my understanding of the problem goes) is to make use of the fourth argument of the substr function on scalar strings which allows for replacement strings to be inserted - This would allow replacement strings in the DNA strands to be inserted as follows:

substr( $dna, $start + 1, $finish - $start, "X" x ( $finish - $start ) + );

The major problem which you appear to be facing with the existing code is that the $finish is not being defined at any point in your code. Note that the variables @finish and $finish are not equivalent and that scoping issues also come into effect with the code provided - It would be worthwhile your reading of Coping with Scoping written by Dominus.

The following is how I would rewrite the code snippet provided - Note that the following is untested and is provided for illustrative purposes:

my $dna; open DNA, "<dna.file" or die $!; foreach (<DNA>) { chomp; $dna .= $_; } close DNA; open POS, "<positions.input" or die $!; foreach (<POS>) { chomp; my ($start, $finish) = split /\s+/; substr( $dna, $start + 1, $finish - $start, "X" x ( $finish - $sta +rt ) ); print $dna, "\n"; } close POS;

This code snippet differs slightly from that provided in a few ways - Firstly, the manner of populating the $dna string differs in that the above example code populates the DNA scalar string by reading each line of the supplied file, chomping new-line characters and concatenating the input onto the $dna string rather than slurping it into an array first. This code then steps through the positions input file and splices DNA scalar accordingly, replacing introns/extrons with 'X'-strings of identical length using the substr function. In this fashion, the start and finish bases for splicing do not need to be stored beyond their use in the iterative loop.

The snippet of code provided raises a few other questions to my mind, specifically relating to the definition and usage of other variables including $count and $base - Are these variables used elsewhere in your code beyond the bounds of the snippet provided? If not then it appears that these variables are entirely unnecessary in the code snippet provided.

 

Replies are listed 'Best First'.
print $dna each time?
by RMGir (Prior) on Apr 23, 2002 at 13:41 UTC
    Nice fixes, rob_au. But do you really want to print $dna each time thru the loop? I know the original code did that, too, but I think that might be another buglet.

    I think you'd only want to do it once, after the loop.

    Also, the original code did

    $dna =~ s/(.{60})/$1\n/g; print "$dna\n"; print OUTFILE "$dna\n";
    which would wrap the lines. $dna is also output to OUTFILE, as you can see.
    --
    Mike
      Wholly agreed Mike - The printing of the $dna variable with each loop iteration may be an unnecessary remnant from earlier script development - This functionality was incorporated in the code snippet I provided purely for comparitive purposes between code snippets.

      The other aspects of output wrapping and printing to an output file were omitted for clarity of the code and functionality in question.

      Thanks for your comments and feedback.