Replacing substrings within hash values

K_Edw has asked for the wisdom of the Perl Monks concerning the following question:

I have a set of strings interspersed with header-lines marked with '>' - specifically, a .FASTA file containing multiple DNA sequences which are read into a hash using bioPerl (Bio::SeqIO). A short example is:

>I
CATCAGTATAAAATGACTAGTAGCTAGATACCACAGATACGATACAACA
>II
TACCACAGATACGATACAACACATCAGTATAAAATGACTAGTAGCAGAC
[download]

Additionally, I have a second input file which specifies changes to any given DNA sequence, in this format: (shortened example and additional columns present but now shown)

I    2    A    G
I    4    C    T
I    5    A    G
I    7    T    C
II    1    T    C
II    2    A    G
II    3    C    T
II    5    A    C
II    8    G    T
II    10    T    G
[download]

$F0 = DNA sequence the changes should be made to
$F1 = Letter position that is involved in the change
$F2 = Pre-existing letter
$F3 = Replacement letter
[download]

I'm trying to write a simple script that will iterate through each DNA sequence, load up all the changes pertaining to it and make the substring replacements.

For example DNA sequence I will go from:

CATCAGTATAAAATGACTAGTAGCTAGATACCACAGATACGATACAACA

To:

CGTTGGCATAAAATGACTAGTAGCTAGATACCACAGATACGATACAACA

I'm very new to Perl and have run into some problems. This is my script so far:

#!/usr/bin/env perl
use strict;
use warnings;
use Bio::SeqIO;

#Read and store .FASTA file
my %sequences;
my $seqio = Bio::SeqIO->new(-file => $ARGV[0]);
while(my $seqobj = $seqio->next_seq) {
    my $id  = $seqobj->display_id;
    my $seq = $seqobj->seq;
    $sequences{$id} = $seq;
}

#Variant Input File & Output
my $variants = $ARGV[1];
open IN, '<', $variants or die "$!";
open OUT, '>', 'output.txt' or die "$!";

#Loops
my $iteration = 0;
foreach my $key (sort keys%sequences) {
    $iteration ++;
    my $value = $sequences{$key};
    while (<IN>) {
        my (@F) = split("\t", $_);
        if ($F[0] eq $key) {
            substr($value,0,$F[1]) = $F[3];
        } else {
            last 
        }
    }
    print OUT ">$key\n$value" if $iteration=~1;
    print OUT "\n>$key\n$value" if $iteration>1;
}
[download]

The output is basically a copy of the input - the changes have not been made to each sequence but I am not sure why. I suspect the while loop is not structured correctly or my use of 'last' is incorrect (i'm trying to break out of the while loop to move onto the next DNA sequence). I'm sure there is a better approach. Any help would be much appreciated, especially updates/critique of the above code and pointing out where it is wrong as I plan to eventually include more complex changes (such as removal of letters instead of replacement) and feel this is a good way to access each hash value in-turn (but I may be wrong). Updates made - working, but inefficient:

substr($value,$F[1]-1,1) = $F[3];
use 'next' instead of 'last'
added seek(IN,0,0);
[download]

Comment on Replacing substrings within hash values Select or Download Code

Replies are listed 'Best First'.
Re: Replacing substrings within hash values by BrowserUk (Patriarch) on Mar 24, 2016 at 10:11 UTC
Perhaps this will help? (Sorry, I'm too lazy to use Bio::*): `#! perl -slw use strict; use Inline::Files; use Data::Dump qw[ pp ]; my %seqs = map{ split "\n", $_ } <FASTA>; pp \%seqs; while( my( $seq, $pos, $chr, $rep ) = split ' ', <EDITS> ) { substr( $seqs{ '>' . $seq }, $pos-1, 1, $rep ); } pp \%seqs; __FASTA__ >I CATCAGTATAAAATGACTAGTAGCTAGATACCACAGATACGATACAACA >II TACCACAGATACGATACAACACATCAGTATAAAATGACTAGTAGCAGAC __EDITS__ I 2 A G I 4 C T I 5 A G I 7 T C II 1 T C II 2 A G II 3 C T II 5 A C II 8 G T II 10 T G` [download] Output: `C:\test>1158701 { ">I" => "CATCAGTATAAAATGACTAGTAGCTAGATACCACAGATACGATACAACA", ">II" => "TACCACAGATACGATACAACACATCAGTATAAAATGACTAGTAGCAGAC", } { ">I" => "CGTTGGCATAAAATGACTAGTAGCTAGATACCACAGATACGATACAACA", ">II" => "CGTCCCATAGACGATACAACACATCAGTATAAAATGACTAGTAGCAGAC", }` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Replacing substrings within hash values by K_Edw (Beadle) on Mar 24, 2016 at 11:20 UTC
Thanks! I'll have a go at understanding this. I've not come across 'map' before. It would also be good if somebody could help me understand what's wrong with my own code (for learning purposes), and also because I may wish to introduce more ways of editing such as deleting certain letters and thus needing to keep track of changes as it goes to ensure replacements still occur in the right place.	[reply]
Re^3: Replacing substrings within hash values by BillKSmith (Monsignor) on Mar 24, 2016 at 13:27 UTC
The main problem with your code is that you must read the entire IN file for each sequence. This would work if you rewind (`seek IN, 0, 0`) the IN file at the end of the sequence loop. This approach is very inefficient. Run-time would become unacceptable for larger files. Your use of `last` is fine for sequence I. For all other sequences, it would exit the loop before it gets to the processing. Use `next` instead. Bill	[reply] [d/l] [select]
Re^4: Replacing substrings within hash values by K_Edw (Beadle) on Mar 24, 2016 at 13:57 UTC
Re^5: Replacing substrings within hash values by poj (Abbot) on Mar 24, 2016 at 14:20 UTC
Some notes below your chosen depth have not been shown here
Re^5: Replacing substrings within hash values by BillKSmith (Monsignor) on Mar 24, 2016 at 22:48 UTC
Some notes below your chosen depth have not been shown here
Re: Replacing substrings within hash values by Anonymous Monk on Mar 24, 2016 at 13:39 UTC
`foreach my $key (sort keys%sequences)` [download] Are I and II Roman numerals? Are they supposed to be in that order (I, II, III, IV etc)? Perl doesn't speak Latin, so it, for example, will sort IX before VI. Also, consider: `my $str = "abcde"; print substr( $str, 0, 3 );` [download] That prints `abc`. Just reread the doc of substr.	[reply] [d/l] [select]
Re^2: Replacing substrings within hash values by K_Edw (Beadle) on Mar 24, 2016 at 13:50 UTC
They are indeed roman numerals - they can easily be substituted once i've got a general idea of how the script should work. I see that i've misunderstood substr. So it should be: `substr($value,$F[1]-1,1) = $F[3]` I think.	[reply] [d/l]