heidi has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, I want to delete a particular character in a string of one file, based on the corresponding scores in a different file. To explain the query in detail......
file number 1 - seq.txt >s1 AGCTTTTCGGGCAAT >s2 GCTGCCCCCCATCTT >s3 TCGTAGCTGAAAATC file number 2 - num.txt >s1 23 43 45 65 76 54 3 34 54 65 7 45 56 87 56 >s2 23 43 23 45 65 45 76 78 34 8 12 32 65 23 25 >s3 12 23 34 45 56 54 43 32 65 43 12 34 75 76 45 things to be done. 1) The sequence file and number file contains same number of enteries with same ID. 2) The number of nucleotides in a sequence in as same as the score of each nucleotide in the num.txt file.that is for example in seq.txt file, >s1 has 15 nucleotides, and in num.txt, >s1 has 15 scores for each corresponding nucleotide in the other file. 3) first thing to be checked is, if there is a repetition in the the bases(nucleotides). for this we can check only seq.txt file.for example: >s1 has 4 "T" and 3 "G".....>s2 has 6 "C"...... and >s3 has 4 "A"..... 4) we have to consider it as repeats only if the bases are continously 3 or more times. for example, in >s1 there is a 2 times "A" near the end of the sequence.....in >s2 +there is a 2 times "T" at the end...these should not be taken as repeated bases. 5) Once we choose the repeated bases according to point number (step 3), we have to take the corresponding ID in the num.txt file. 6) Example: -open seq.txt and open num.txt -process >s1 in both the files. -positions 4-7 is the repeat... "TTTT" in this case. -the corresponding scores in position 4-7 is "65 76 54 3" -there fore its T T T T 65 76 54 3 -we have to check only the last base SCORE. -If it is less than 10, we have to delete the score and nucleotide. -hence the result sequence should be >s1 AGCTTTCGGGCAAT >s1 23 43 45 65 76 54 34 54 65 7 45 56 87 56 This has to be done for all the sequences in the file. >s3 wil not have any change coz, the last base of the repeat is more than 10. ----------------------------------------------------------- sooooooo, my SAMPLE RESULT FILE for the above input should look like: file number 1 - seq.txt >s1 AGCTTTCGGCAAT >s2 GCTGCCCCCATCTT >s3 TCGTAGCTGAAAATC file number 2 - num.txt >s1 23 43 45 65 76 54 34 54 65 45 56 87 56 >s2 23 43 23 45 65 45 76 78 34 12 32 65 23 25 >s3 12 23 34 45 56 54 43 32 65 43 12 34 75 76 45 ----------------------------------------------
Also, is it possible to find the position where we delete the character?
  • Comment on deleting a particular character and its coresponding score in 2 different files!!!
  • Download Code

Replies are listed 'Best First'.
Re: deleting a particular character and its coresponding score in 2 different files!!!
by ikegami (Patriarch) on Oct 14, 2008 at 05:10 UTC

    You've already asked this question.

    What progress have you made since adding use warnings;?

      yeah, i tried with use warnings and use strict, here is the program which i tried.
      #!/usr/bin/perl use strict; use warnings; open(FH1,"read.txt"); open(FH2,"qual.txt"); my @arr1=<FH1>; my @arr2=<FH2>; my $joi1=join(' ',@arr1); my $joi2=join(' ',@arr2); my @new=split('>',$joi1); my @numbers=split('>',$joi2); my @new; my @seqid; foreach(@new){ my($seq_id,$seq)=split(/\n/,$_); push(@alp,split(' ',$seq)); push(@seqid,$seq_id); } my @numbers; my @numid; foreach(@numbers){ my ($num_id,$numb)=split(/\n/,$_); push(@num,split(' ',$numb)) +; push(@numid,$num_id) } my @keep= grep {$_ < 1 || $num[$_]>=10 || $alp[$_-1] ne $alp[$_-0]} 0. +.$#num; print (join(' ',@alp[@keep]),"\n"); print (join(' ',@num[@keep]),"\n");
      the result which i got is
      Use of uninitialized value in split at sample.pl line 15,<FH2> line 6. Use of uninitialized value in split at sample.pl line 21,<FH2> line 6. Use of uninitialized value in string ne at sample.pl line 24, <FH2> li +ne 6. Use of uninitialized value in join or string at sample.pl line 25, <FH +2> line 6. TGACTTTTGCAAAGCTCGTA TGACTTTTGCAAAGCTCGTA TGACTTTTGCAAAGCTCGTA + 34 45 34 23 32 43 54 45 3 +4 12 23 45 54 65 34 23 54 42 34 45 34 23 32 43 54 45 34 12 23 45 54 6 +5 34 23 54 42 34 45 34 23 32 43 54 45 34 12 23 45 54 65 34 23 54 42
      Is this problem because i am using more than one string from 2 different files? in the above program, why are we not comparing the character repetition at all? In some cases, if the number is less than 10, but doesnt come into repeted continous characters, it should be removed. Please suggest what can be done. I am not very good at perl :-( Thanks

        No. It is because you are being silly. Some important lines of code are:

        my @new = split ('>', $joi1); ... my @new; .. foreach (@new) {

        Which of the two @new variables do you expect will be used by the for loop?

        Why didn't you show us the '"my" variable @new masks earlier declaration in same scope at ...' errors you received or the errors about @alp and @num? Have you actually run the code you've shown us?

        It helps us a lot to help you if you give use sample code we can actually run and if you show us the result you get and the result you would like from running the code. Oh, and the sample code and results should be short. For example cleaning up your code a little including fixing all stricture warnings and errors:

        use warnings; use strict; my $seq = <<DATA; >s1 AGCTTTTCGGGCAAT >s2 GCTGCCCCCCATCTT >s3 TCGTAGCTGAAAATC DATA my $num = <<DATA; >s1 23 43 45 65 76 54 3 34 54 65 7 45 56 87 56 >s2 23 43 23 45 65 45 76 78 34 8 12 32 65 23 25 >s3 12 23 34 45 56 54 43 32 65 43 12 34 75 76 45 DATA open FH1, '<', \$seq; open FH2, '<', \$num; my @arr1 = <FH1>; my @arr2 = <FH2>; my $joi1 = join (' ', @arr1); my $joi2 = join (' ', @arr2); my @new = grep length, split ('>', $joi1); my @numbers = grep length, split ('>', $joi2); my @seqid; my @alp; foreach (@new) { my ($seq_id, $seq) = split (/\n/, $_); push (@alp, grep /[ACGT]/, split ('', $seq)); push (@seqid, $seq_id); } my @num; my @numid; foreach (@numbers) { my ($num_id, $numb) = split (/\n/, $_); push (@num, split (' ', $numb)); push ( @numid, $num_id ); } my @keep = grep {$_ < 1 || $num[$_] >= 10 || $alp[$_ - 1] ne $alp[$_ - 0]} 0 +.. $#num; print (join (' ', @alp[@keep]), "\n"); print (join (' ', map {sprintf '%2d', $_} @num[@keep]), "\n");

        prints:

        A G C T T T C G G C A A T G C T G C C C C C A T + C T T T C G T A G C T G A A A A T C 23 43 45 65 76 54 34 54 65 45 56 87 56 23 43 23 45 65 45 76 78 34 12 3 +2 65 23 25 12 23 34 45 56 54 43 32 65 43 12 34 75 76 45

        Perl reduces RSI - it saves typing
        Use of uninitialized value in string ne at sample.pl line 24, <FH2> line 6.
        This refers to

        my @keep= grep {$_ < 1 || $num[$_]>=10 || $alp[$_-1] ne $alp[$_-0]} 0..$#num;

        This statement appears to be incorrectly constructed. I believe the first argument should be a regex, not a logical conditional.

        Revised: Please ignore. As this points out, I was wrong.
Re: deleting a particular character and its coresponding score in 2 different files!!!
by gone2015 (Deacon) on Oct 14, 2008 at 14:05 UTC

    The posting appears to have been updated considerably... and the specification has shifted. Plus ça change :-(

    OK. I give up. I really hope this is something useful and not just homework. The program below gives:

      >s1:  A  G  C  T  T  T  T  C  G  G  G  C  A  A  T
           23 43 45 65 76 54  3 34 54 65  7 45 56 87 56
        ->  A  G  C  T  T  T  C  G  G  C  A  A  T -- dropped 6 10
           23 43 45 65 76 54 34 54 65 45 56 87 56
      >s2:  G  C  T  G  C  C  C  C  C  C  A  T  C  T  T
           23 43 23 45 65 45 76 78 34  8 12 32 65 23 25
        ->  G  C  T  G  C  C  C  C  C  A  T  C  T  T -- dropped 9
           23 43 23 45 65 45 76 78 34 12 32 65 23 25
      >s3:  T  C  G  T  A  G  C  T  G  A  A  A  A  T  C
           12 23 34 45 56 54 43 32 65 43 12 34 75 76 45
        ->  T  C  G  T  A  G  C  T  G  A  A  A  A  T  C -- all kept
           12 23 34 45 56 54 43 32 65 43 12 34 75 76 45
    

      This one gives me in 2 different files..
      #!/usr/bin/perl open FILE,"seq.txt"; open OUT1,">seq.out"; open OUT2,">num.out"; $seq = "";$num = ""; while(<FILE>) { if($_ =~ /^>/) { push(@seq,$seq) if $seq; $seq=''; push(@seq,"\n$_"); } else { $_ =~ s/\s*$//; $seq = $seq.$_; } } push(@seq,$seq); close(FILE); open FILE,"num.txt"; while(<FILE>) { if($_ =~ /^>/) { push(@num,$num) if $num; $num=''; push(@num,"\n$_"); } else { $_ =~ s/\s*$/ /; $num = $num."$_"; } } push(@num,$num); close(FILE); while($_ = shift @seq) { my @marker; if( $_ =~ />/ ) { $shead = $_; $nhead = shift @num; $seq = shift @seq; $num = shift @num; #print $seq,"\n"; #print $num,"\n"; @s = split (//,$seq); @n = split (/ /,$num); $pre = "null"; $cnt = 1; $arrcnt = -1; foreach(@s) { if($_ eq $pre) { $cnt++; } else { if ($cnt >= 3 && $n[$arrcnt] < 10) { push(@marker,$arrcnt); } $cnt = 1; } $pre = $_; $arrcnt++; } $exp = 0; foreach(@marker) { $_ = $_ - $exp; splice(@n,$_,1); splice(@s,$_,1); $exp++; } $seq = join(" ",@s); $num = join(" ",@n); print OUT1 $shead; print OUT1 $seq,"\n"; print OUT2 $shead; print OUT2 $num,"\n"; } else { print "give proper input file"; exit; } }
      THANK U ALL FOR THE HELP...Am learning a lot :)
        well...thats "myself" on the above program... THANK U ALL FOR THE HELP...Am learning a lot :)