Issues with substr and substituting

lecb has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, Please may I kindly ask for some assistance please? Please be gentle however.. I am not an expert coder and I am also incredible inefficient, but for now I have to code in a way that makes logical sense to me. I have tried my best for 5 days with this and now I'm at the point where I need some external help...

The input file is commented in the code since I can't upload it for you. Essentially what I am trying to do is change the last column of the input (a 9 letter sequence) by changing each letter to an alternative of 3 letters AGCT, then moving onto the next letter in the sequence.

I.e. for each character of the last column, if it matches A, then switch to G then print new line, switch to T then print new line, switch to C then print new line (whilst remaining sequence stays same).

Example 1: for >chr1 67093601 67093610 CTACAGGAA -> switch to GTACAGGAA, TTACAGGAA, ATACAGGAA.

Then: CTACAGGAA -> switch to CAACAGGAA, CGACAGGAA, CCACAGGAA etc.

Therefore, for each sequence there should be 27 new sequences.

From looking at the output, I think the issue lies in the "$sequence =~ s/$test/C/;" bit of code, but I don't know how to fix it... the $j seems to count along the string as I'd like it to, but sometimes the wrong letter in the sequence in changed, and sometimes it is not!

If anyone can help me, I will be very, very grateful.

BW LECB

#!/usr/bin/perl
use strict;
use Data::Dumper;
use Storable;


## 5_prime_ref.fasta => 
# >chr1   67093601        67093610        CTACAGGAA
# >chr1   67095418        67095427        TTTCAGAAA
# >chr1   67096318        67096327        TTTCAGCTC
# >chr1   67115461        67115470        TTCCAGGCA
# >chr1   67125906        67125915        CTGCAGATA
# >chr1   67127254        67127263        AAATAGGAA
# >chr1   67131224        67131233        TAATAGATT
# >chr1   67093601        67093610        CTACAGGAA
# >chr1   67096318        67096327        TTTCAGCTC
# >chr1   67103379        67103388        GTTTAGATG
# >chr1   67111641        67111650        CTTTAGAGC
# >chr1   67115461        67115470        TTCCAGGCA
# >chr1   67125906        67125915        CTGCAGATA
# >chr1   67127254        67127263        AAATAGGAA
# >chr1   67131224        67131233        TAATAGATT
# >chr1   67093601        67093610        CTACAGGAA
# >chr1   67096318        67096327        TTTCAGCTC
# >chr1   67103379        67103388        GTTTAGATG
# >chr1   67111641        67111650        CTTTAGAGC
# >chr1   67113753        67113762        ATATAGTGA
# >chr1   201324579       201324588       GGGTAAGGT


open my $fasta5, '<', '5_prime_ref.fasta';
chomp (my @fasta5 = <$fasta5>);
close $fasta5;



my ($altsequence1, $altsequence2, $altsequence3, @newfasta5, @newfasta
+3, $sequence, @splity, $test, $j, $genomicposition, $ref, $alt1, $alt
+2, $alt3, $refsequence);
## for 5 prime ## >chr1   201324579       201324588       GGGTAAGGT
foreach my $line (@fasta5) { 
## need to iterate 9 times through j... 
    $j = 0;
    while ($j < 9) {
    @splity = split ("\t", $line);
    my $ref = $splity[3];
    $sequence = $splity[3];
    $refsequence = $splity[3];
    $genomicposition = $splity[1];
    $genomicposition = $genomicposition + 1; #first nucleotide is coun
+ted as first column + 1, not as the left boundary
    $splity[0] = ">".$splity[0];
    
    ##if ($character $j in sequence =~ matches A) {
    if (substr ($sequence, $j, 1) =~ m/A/) {
    $ref = "A";
    $genomicposition = $genomicposition + $j;
    $test = substr ($sequence, $j, 1);
    $sequence =~ s/$test/C/;
    $altsequence1 = $sequence;
    $alt1 = "C";
    $test = substr ($sequence, $j, 1);
    $sequence =~ s/$test/G/;
    $altsequence2 = $sequence;
    $alt2 = "G";
    $test = substr ($sequence, $j, 1);
    $sequence =~ s/$test/T/;
    $altsequence3 = $sequence;
    $alt3 = "T";
        }

    elsif (substr ($sequence, $j, 1) =~ m/C/) {
    $test = substr ($sequence, $j, 1);
    $ref = "C";
    $genomicposition = $genomicposition + $j;
    $sequence =~ s/$test/A/;
    $altsequence1 = $sequence;
    $alt1 = "A";
    $test = substr ($sequence, $j, 1);
    $sequence =~ s/$test/G/;
    $altsequence2 = $sequence;
    $alt2 = "G";
    $test = substr ($sequence, $j, 1);
    $sequence =~ s/$test/T/;
    $altsequence3 = $sequence;
    $alt3 = "T";
    }
    
    elsif (substr ($sequence, $j, 1) =~ m/T/) {
    $test = substr ($sequence, $j, 1);
    $ref = "T";
    $genomicposition = $genomicposition + $j;
    $sequence =~ s/$test/A/;
    $altsequence1 = $sequence;
    $alt1 = "A";
    $test = substr ($sequence, $j, 1);
    $sequence =~ s/$test/G/;
    $altsequence2 = $sequence;
    $alt2 = "G";
    $test = substr ($sequence, $j, 1);
    $sequence =~ s/$test/C/;
    $altsequence3 = $sequence;
    $alt3 = "C";
    }
            
    else {      
    $ref = "G";
    $genomicposition = $genomicposition + $j;
    $test = substr ($sequence, $j, 1);
    $sequence =~ s/$test/A/;
    $altsequence1 = $sequence;
    $alt1 = "A";
    $test = substr ($sequence, $j, 1);
    $sequence =~ s/$test/T/;
    $altsequence2 = $sequence;
    $alt2 = "T";
    $test = substr ($sequence, $j, 1);
    $sequence =~ s/$test/C/;
    $altsequence3 = $sequence;
    $alt3 = "C";
            }
            
    
    ## print the new 3 lines        
    my $newfasta5alt1 = join ("\t", $refsequence, $splity[0], $splity[
+1], $splity[2], $altsequence1, $ref, $alt1, $genomicposition);
    my $newfasta5alt2 = join ("\t", $refsequence, $splity[0], $splity[
+1], $splity[2], $altsequence2, $ref, $alt2, $genomicposition);
    my $newfasta5alt3 = join ("\t", $refsequence, $splity[0], $splity[
+1], $splity[2], $altsequence3, $ref, $alt3, $genomicposition);
    
    my $newfasta5 = join ("\n", $newfasta5alt1, $altsequence1, $newfas
+ta5alt2, $altsequence2, $newfasta5alt3, $altsequence3); 
    push (@newfasta5, $newfasta5);
    $j++;}
next;
    }
    
    print Dumper @newfasta5;
[download]

Comment on Issues with substr and substituting Download Code

Replies are listed 'Best First'.
Re: Issues with substr and substituting by Corion (Patriarch) on Sep 27, 2016 at 16:05 UTC
You are substituting - why don't you use the four-argument version of substr directly, or the lvalue form? `substr($sequence, $j, 1) = 'T';` [download]	[reply] [d/l]
Re^2: Issues with substr and substituting by lecb (Acolyte) on Sep 27, 2016 at 16:22 UTC
Thank you kindly Corion, I didn't realise I could do that! Works a treat now!! Thank you again very much.	[reply]
Re: Issues with substr and substituting by tybalt89 (Monsignor) on Sep 27, 2016 at 16:26 UTC
#!/usr/bin/perl # http://perlmonks.org/?node_id=1172760 use strict; use warnings; while(<DATA>) { print; while( /(\w)(?=\w*$)/g ) { print "$`$_$'" for grep $_ ne $1, qw( A C G T ); } } __DATA__ >chr1 67093601 67093610 CTACAGGAA >chr1 67095418 67095427 TTTCAGAAA >chr1 67096318 67096327 TTTCAGCTC >chr1 67115461 67115470 TTCCAGGCA >chr1 67125906 67125915 CTGCAGATA >chr1 67127254 67127263 AAATAGGAA >chr1 67131224 67131233 TAATAGATT >chr1 67093601 67093610 CTACAGGAA >chr1 67096318 67096327 TTTCAGCTC >chr1 67103379 67103388 GTTTAGATG >chr1 67111641 67111650 CTTTAGAGC >chr1 67115461 67115470 TTCCAGGCA >chr1 67125906 67125915 CTGCAGATA >chr1 67127254 67127263 AAATAGGAA >chr1 67131224 67131233 TAATAGATT >chr1 67093601 67093610 CTACAGGAA >chr1 67096318 67096327 TTTCAGCTC >chr1 67103379 67103388 GTTTAGATG >chr1 67111641 67111650 CTTTAGAGC >chr1 67113753 67113762 ATATAGTGA >chr1 201324579 201324588 GGGTAAGGT [download]	[reply] [d/l]
Re: Issues with substr and substituting by johngg (Canon) on Sep 30, 2016 at 21:11 UTC
The subroutine in the code below will handle sequence strings of varying length and will also allow for placeholder letters in the sequence. It loops over each letter in the sequence and uses a translation hash and the four-argument form of substr in a map. use strict; use warnings; use feature qw{ say }; say q{-} x 9; say for @{ expand( q{AGCT} ) }; say q{-} x 9; say for @{ expand( q{AXCT} ) }; say q{-} x 9; say for @{ expand( q{CTACAGGAA} ) }; say q{-} x 9; sub expand { my $seq = shift; my %translate = ( A => [ qw{ G C T } ], G => [ qw{ A C T } ], C => [ qw{ A G T } ], T => [ qw{ A G C } ], ); my $raExpands = []; for my $posn ( 0 .. length( $seq ) - 1 ) { my $letter = substr $seq, $posn, 1; next unless $letter =~ m{[AGCT]}; my $copy = $seq; push @{ $raExpands }, map { substr $copy, $posn, 1, $_; $copy; } @{ $translate{ $letter } }; } return $raExpands; } [download] The output. Read more... (618 Bytes) I hope this is useful. Cheers, JohnGG	[reply] [d/l] [select]