lecb has asked for the wisdom of the Perl Monks concerning the following question:
Dear Monks, Please may I kindly ask for some assistance please? Please be gentle however.. I am not an expert coder and I am also incredible inefficient, but for now I have to code in a way that makes logical sense to me. I have tried my best for 5 days with this and now I'm at the point where I need some external help...
The input file is commented in the code since I can't upload it for you. Essentially what I am trying to do is change the last column of the input (a 9 letter sequence) by changing each letter to an alternative of 3 letters AGCT, then moving onto the next letter in the sequence.
I.e. for each character of the last column, if it matches A, then switch to G then print new line, switch to T then print new line, switch to C then print new line (whilst remaining sequence stays same).
Example 1: for >chr1 67093601 67093610 CTACAGGAA -> switch to GTACAGGAA, TTACAGGAA, ATACAGGAA.
Then: CTACAGGAA -> switch to CAACAGGAA, CGACAGGAA, CCACAGGAA etc.
Therefore, for each sequence there should be 27 new sequences.
From looking at the output, I think the issue lies in the "$sequence =~ s/$test/C/;" bit of code, but I don't know how to fix it... the $j seems to count along the string as I'd like it to, but sometimes the wrong letter in the sequence in changed, and sometimes it is not!
If anyone can help me, I will be very, very grateful.
BW LECB
#!/usr/bin/perl use strict; use Data::Dumper; use Storable; ## 5_prime_ref.fasta => # >chr1 67093601 67093610 CTACAGGAA # >chr1 67095418 67095427 TTTCAGAAA # >chr1 67096318 67096327 TTTCAGCTC # >chr1 67115461 67115470 TTCCAGGCA # >chr1 67125906 67125915 CTGCAGATA # >chr1 67127254 67127263 AAATAGGAA # >chr1 67131224 67131233 TAATAGATT # >chr1 67093601 67093610 CTACAGGAA # >chr1 67096318 67096327 TTTCAGCTC # >chr1 67103379 67103388 GTTTAGATG # >chr1 67111641 67111650 CTTTAGAGC # >chr1 67115461 67115470 TTCCAGGCA # >chr1 67125906 67125915 CTGCAGATA # >chr1 67127254 67127263 AAATAGGAA # >chr1 67131224 67131233 TAATAGATT # >chr1 67093601 67093610 CTACAGGAA # >chr1 67096318 67096327 TTTCAGCTC # >chr1 67103379 67103388 GTTTAGATG # >chr1 67111641 67111650 CTTTAGAGC # >chr1 67113753 67113762 ATATAGTGA # >chr1 201324579 201324588 GGGTAAGGT open my $fasta5, '<', '5_prime_ref.fasta'; chomp (my @fasta5 = <$fasta5>); close $fasta5; my ($altsequence1, $altsequence2, $altsequence3, @newfasta5, @newfasta +3, $sequence, @splity, $test, $j, $genomicposition, $ref, $alt1, $alt +2, $alt3, $refsequence); ## for 5 prime ## >chr1 201324579 201324588 GGGTAAGGT foreach my $line (@fasta5) { ## need to iterate 9 times through j... $j = 0; while ($j < 9) { @splity = split ("\t", $line); my $ref = $splity[3]; $sequence = $splity[3]; $refsequence = $splity[3]; $genomicposition = $splity[1]; $genomicposition = $genomicposition + 1; #first nucleotide is coun +ted as first column + 1, not as the left boundary $splity[0] = ">".$splity[0]; ##if ($character $j in sequence =~ matches A) { if (substr ($sequence, $j, 1) =~ m/A/) { $ref = "A"; $genomicposition = $genomicposition + $j; $test = substr ($sequence, $j, 1); $sequence =~ s/$test/C/; $altsequence1 = $sequence; $alt1 = "C"; $test = substr ($sequence, $j, 1); $sequence =~ s/$test/G/; $altsequence2 = $sequence; $alt2 = "G"; $test = substr ($sequence, $j, 1); $sequence =~ s/$test/T/; $altsequence3 = $sequence; $alt3 = "T"; } elsif (substr ($sequence, $j, 1) =~ m/C/) { $test = substr ($sequence, $j, 1); $ref = "C"; $genomicposition = $genomicposition + $j; $sequence =~ s/$test/A/; $altsequence1 = $sequence; $alt1 = "A"; $test = substr ($sequence, $j, 1); $sequence =~ s/$test/G/; $altsequence2 = $sequence; $alt2 = "G"; $test = substr ($sequence, $j, 1); $sequence =~ s/$test/T/; $altsequence3 = $sequence; $alt3 = "T"; } elsif (substr ($sequence, $j, 1) =~ m/T/) { $test = substr ($sequence, $j, 1); $ref = "T"; $genomicposition = $genomicposition + $j; $sequence =~ s/$test/A/; $altsequence1 = $sequence; $alt1 = "A"; $test = substr ($sequence, $j, 1); $sequence =~ s/$test/G/; $altsequence2 = $sequence; $alt2 = "G"; $test = substr ($sequence, $j, 1); $sequence =~ s/$test/C/; $altsequence3 = $sequence; $alt3 = "C"; } else { $ref = "G"; $genomicposition = $genomicposition + $j; $test = substr ($sequence, $j, 1); $sequence =~ s/$test/A/; $altsequence1 = $sequence; $alt1 = "A"; $test = substr ($sequence, $j, 1); $sequence =~ s/$test/T/; $altsequence2 = $sequence; $alt2 = "T"; $test = substr ($sequence, $j, 1); $sequence =~ s/$test/C/; $altsequence3 = $sequence; $alt3 = "C"; } ## print the new 3 lines my $newfasta5alt1 = join ("\t", $refsequence, $splity[0], $splity[ +1], $splity[2], $altsequence1, $ref, $alt1, $genomicposition); my $newfasta5alt2 = join ("\t", $refsequence, $splity[0], $splity[ +1], $splity[2], $altsequence2, $ref, $alt2, $genomicposition); my $newfasta5alt3 = join ("\t", $refsequence, $splity[0], $splity[ +1], $splity[2], $altsequence3, $ref, $alt3, $genomicposition); my $newfasta5 = join ("\n", $newfasta5alt1, $altsequence1, $newfas +ta5alt2, $altsequence2, $newfasta5alt3, $altsequence3); push (@newfasta5, $newfasta5); $j++;} next; } print Dumper @newfasta5;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Issues with substr and substituting
by Corion (Patriarch) on Sep 27, 2016 at 16:05 UTC | |
by lecb (Acolyte) on Sep 27, 2016 at 16:22 UTC | |
|
Re: Issues with substr and substituting
by tybalt89 (Monsignor) on Sep 27, 2016 at 16:26 UTC | |
|
Re: Issues with substr and substituting
by johngg (Canon) on Sep 30, 2016 at 21:11 UTC |