K_Edw has asked for the wisdom of the Perl Monks concerning the following question:

I have a set of strings interspersed with header-lines marked with '>' - specifically, a .FASTA file containing multiple DNA sequences which are read into a hash using bioPerl (Bio::SeqIO). A short example is:

>I CATCAGTATAAAATGACTAGTAGCTAGATACCACAGATACGATACAACA >II TACCACAGATACGATACAACACATCAGTATAAAATGACTAGTAGCAGAC

Additionally, I have a second input file which specifies changes to any given DNA sequence, in this format: (shortened example and additional columns present but now shown)

I 2 A G I 4 C T I 5 A G I 7 T C II 1 T C II 2 A G II 3 C T II 5 A C II 8 G T II 10 T G
$F0 = DNA sequence the changes should be made to $F1 = Letter position that is involved in the change $F2 = Pre-existing letter $F3 = Replacement letter

I'm trying to write a simple script that will iterate through each DNA sequence, load up all the changes pertaining to it and make the substring replacements.

For example DNA sequence I will go from:

CATCAGTATAAAATGACTAGTAGCTAGATACCACAGATACGATACAACA

To:

CGTTGGCATAAAATGACTAGTAGCTAGATACCACAGATACGATACAACA

I'm very new to Perl and have run into some problems. This is my script so far:

#!/usr/bin/env perl use strict; use warnings; use Bio::SeqIO; #Read and store .FASTA file my %sequences; my $seqio = Bio::SeqIO->new(-file => $ARGV[0]); while(my $seqobj = $seqio->next_seq) { my $id = $seqobj->display_id; my $seq = $seqobj->seq; $sequences{$id} = $seq; } #Variant Input File & Output my $variants = $ARGV[1]; open IN, '<', $variants or die "$!"; open OUT, '>', 'output.txt' or die "$!"; #Loops my $iteration = 0; foreach my $key (sort keys%sequences) { $iteration ++; my $value = $sequences{$key}; while (<IN>) { my (@F) = split("\t", $_); if ($F[0] eq $key) { substr($value,0,$F[1]) = $F[3]; } else { last } } print OUT ">$key\n$value" if $iteration=~1; print OUT "\n>$key\n$value" if $iteration>1; }
The output is basically a copy of the input - the changes have not been made to each sequence but I am not sure why. I suspect the while loop is not structured correctly or my use of 'last' is incorrect (i'm trying to break out of the while loop to move onto the next DNA sequence). I'm sure there is a better approach. Any help would be much appreciated, especially updates/critique of the above code and pointing out where it is wrong as I plan to eventually include more complex changes (such as removal of letters instead of replacement) and feel this is a good way to access each hash value in-turn (but I may be wrong).

Updates made - working, but inefficient:

substr($value,$F[1]-1,1) = $F[3]; use 'next' instead of 'last' added seek(IN,0,0);

Replies are listed 'Best First'.
Re: Replacing substrings within hash values
by BrowserUk (Patriarch) on Mar 24, 2016 at 10:11 UTC

    Perhaps this will help? (Sorry, I'm too lazy to use Bio::*):

    #! perl -slw use strict; use Inline::Files; use Data::Dump qw[ pp ]; my %seqs = map{ split "\n", $_ } <FASTA>; pp \%seqs; while( my( $seq, $pos, $chr, $rep ) = split ' ', <EDITS> ) { substr( $seqs{ '>' . $seq }, $pos-1, 1, $rep ); } pp \%seqs; __FASTA__ >I CATCAGTATAAAATGACTAGTAGCTAGATACCACAGATACGATACAACA >II TACCACAGATACGATACAACACATCAGTATAAAATGACTAGTAGCAGAC __EDITS__ I 2 A G I 4 C T I 5 A G I 7 T C II 1 T C II 2 A G II 3 C T II 5 A C II 8 G T II 10 T G

    Output:

    C:\test>1158701 { ">I" => "CATCAGTATAAAATGACTAGTAGCTAGATACCACAGATACGATACAACA", ">II" => "TACCACAGATACGATACAACACATCAGTATAAAATGACTAGTAGCAGAC", } { ">I" => "CGTTGGCATAAAATGACTAGTAGCTAGATACCACAGATACGATACAACA", ">II" => "CGTCCCATAGACGATACAACACATCAGTATAAAATGACTAGTAGCAGAC", }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks! I'll have a go at understanding this. I've not come across 'map' before. It would also be good if somebody could help me understand what's wrong with my own code (for learning purposes), and also because I may wish to introduce more ways of editing such as deleting certain letters and thus needing to keep track of changes as it goes to ensure replacements still occur in the right place.

        The main problem with your code is that you must read the entire IN file for each sequence. This would work if you rewind (seek IN, 0, 0) the IN file at the end of the sequence loop. This approach is very inefficient. Run-time would become unacceptable for larger files.

        Your use of last is fine for sequence I. For all other sequences, it would exit the loop before it gets to the processing. Use next instead.

        Bill
Re: Replacing substrings within hash values
by Anonymous Monk on Mar 24, 2016 at 13:39 UTC
    foreach my $key (sort keys%sequences)
    Are I and II Roman numerals? Are they supposed to be in that order (I, II, III, IV etc)? Perl doesn't speak Latin, so it, for example, will sort IX before VI.

    Also, consider:

    my $str = "abcde"; print substr( $str, 0, 3 );
    That prints abc. Just reread the doc of substr.

      They are indeed roman numerals - they can easily be substituted once i've got a general idea of how the script should work.

      I see that i've misunderstood substr. So it should be:

      substr($value,$F[1]-1,1) = $F[3]

      I think.