lolly has asked for the wisdom of the Perl Monks concerning the following question:

Hi, i really need your help. I have one file containing a long string of DNA, and another file containing the start and stop positions of genes in the DNA. like this :
1 1800 1900 2254
etc. in which the first number is the start position and the second is the stop position. I need to map these positions to the DNA string and replace them with X's. I dont know whether I should be treating the start/stop positons file as a hash or array?? - there are lots of values so i couldn't declare them if i used a hash. Error messages say that $finish isn't declared within the second foreach loop. Heres the code so far:
#! /usr/local/bin/perl -w use strict; open (INPUT, "<positions.input") or die "unable to open file"; open (INPUT2, "<dna.file") or die "unable to open file"; open (OUTFILE, ">$$.output"); @dna = <INPUT2>; $dna = join ('', @dna); $dna =~ s/\s//g; @dna = split ('', $dna); $count = 0; $base = ("A"| "C"| "G" | "T"); foreach $base (@dna) { ++$count; } while (<INPUT>) { $line = $_; chomp ($line); @orfs = (); @orfs = split (/\s+/, $line); my @start = $orfs[0]; my @finish = $orfs[1]; } foreach $start (@orfs) { $start = substr ($dna, 0, $start); substr ($dna, $start +1, $finish - $start) =~ s/A-Z/X/g; $dna =~ s/(.{60})/$1\n/g; print "$dna\n"; print OUTFILE "$dna\n"; } close INPUT; close INPUT2; close OUTFILE; exit;
this is my attempt, i realise it is bad and can't think of a good way to do this. THANX. lol

Edit kudra, 2002-04-24 Added to title

Replies are listed 'Best First'.
Re: in need of wisdom
by rob_au (Abbot) on Apr 23, 2002 at 13:16 UTC
    You are actually very close on this one ... While there are a few errors in the code snippet provided, you are on the right track with the use of substr function.

    The key to the solution for this problem (as far as my understanding of the problem goes) is to make use of the fourth argument of the substr function on scalar strings which allows for replacement strings to be inserted - This would allow replacement strings in the DNA strands to be inserted as follows:

    substr( $dna, $start + 1, $finish - $start, "X" x ( $finish - $start ) + );

    The major problem which you appear to be facing with the existing code is that the $finish is not being defined at any point in your code. Note that the variables @finish and $finish are not equivalent and that scoping issues also come into effect with the code provided - It would be worthwhile your reading of Coping with Scoping written by Dominus.

    The following is how I would rewrite the code snippet provided - Note that the following is untested and is provided for illustrative purposes:

    my $dna; open DNA, "<dna.file" or die $!; foreach (<DNA>) { chomp; $dna .= $_; } close DNA; open POS, "<positions.input" or die $!; foreach (<POS>) { chomp; my ($start, $finish) = split /\s+/; substr( $dna, $start + 1, $finish - $start, "X" x ( $finish - $sta +rt ) ); print $dna, "\n"; } close POS;

    This code snippet differs slightly from that provided in a few ways - Firstly, the manner of populating the $dna string differs in that the above example code populates the DNA scalar string by reading each line of the supplied file, chomping new-line characters and concatenating the input onto the $dna string rather than slurping it into an array first. This code then steps through the positions input file and splices DNA scalar accordingly, replacing introns/extrons with 'X'-strings of identical length using the substr function. In this fashion, the start and finish bases for splicing do not need to be stored beyond their use in the iterative loop.

    The snippet of code provided raises a few other questions to my mind, specifically relating to the definition and usage of other variables including $count and $base - Are these variables used elsewhere in your code beyond the bounds of the snippet provided? If not then it appears that these variables are entirely unnecessary in the code snippet provided.

     

      Nice fixes, rob_au. But do you really want to print $dna each time thru the loop? I know the original code did that, too, but I think that might be another buglet.

      I think you'd only want to do it once, after the loop.

      Also, the original code did

      $dna =~ s/(.{60})/$1\n/g; print "$dna\n"; print OUTFILE "$dna\n";
      which would wrap the lines. $dna is also output to OUTFILE, as you can see.
      --
      Mike
        Wholly agreed Mike - The printing of the $dna variable with each loop iteration may be an unnecessary remnant from earlier script development - This functionality was incorporated in the code snippet I provided purely for comparitive purposes between code snippets.

        The other aspects of output wrapping and printing to an output file were omitted for clarity of the code and functionality in question.

        Thanks for your comments and feedback.

         

Re: in need of wisdom
by Zaxo (Archbishop) on Apr 23, 2002 at 14:31 UTC

    Your difficulty may have been from not declaring variables with my. When you use strict; you have to be strict. In any case, substr would be effective for this. Assuming it's sane to have the DNA file all in memory:

    #!/usr/local/bin/perl -w use strict; open SEQ, "< dna.file" or die $!; my @dna = <SEQ>; close SEQ or die $!; chomp @dna; my $dna = join '', @dna; @dna = map {length} @dna; open LOCS, "< positions.input" or die $!; for (<LOCS>) { # I don't know how your data is counted, watch for fencepost error +s my ($start, $end) = split " "; substr( $dna, $start, $end - $start) =~ tr/CGAT/X/; } close LOCS or die $!; { local $\ = $/; open OUTFILE, "> $$.output" or die $!; # print OUTFILE for map { substr $dna, 0, $_, '' } @dna; print OUTFILE substr( $dna, 0, $_, '') for @dna; close OUTFILE or die $!; }
    There are several tidbits here for a student of perl. Untested but It Ought To Work(TM). It might be worthwhile to make this do without hardcoded file names.

    Update: I like map too much. Modified output loop to save memory.

    After Compline,
    Zaxo