deleting a particular character and its coresponding score in 2 different files!!!

heidi has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, I want to delete a particular character in a string of one file, based on the corresponding scores in a different file. To explain the query in detail......

file number 1 - seq.txt
>s1
AGCTTTTCGGGCAAT
>s2
GCTGCCCCCCATCTT
>s3
TCGTAGCTGAAAATC

file number 2 - num.txt
>s1
23 43 45 65 76 54 3 34 54 65 7 45 56 87 56 
>s2
23 43 23 45 65 45 76 78 34 8 12 32 65 23 25
>s3
12 23 34 45 56 54 43 32 65 43 12 34 75 76 45


things to be done.

1) The sequence file and number file contains 
same number of enteries with same ID.

2) The number of nucleotides in a sequence in 
as same as the score of each nucleotide in the 
num.txt file.that is for example in seq.txt file, 
>s1 has 15 nucleotides, and in num.txt, >s1 has
15 scores for each corresponding nucleotide in 
the other file.

3) first thing to be checked is, if there is a 
repetition in the the bases(nucleotides). for 
this we can check only seq.txt file.for example: 
>s1 has 4 "T"  and 3 "G".....>s2 has 6 "C"......
and >s3 has 4 "A"..... 

4) we have to consider it as repeats only if the 
bases are continously 3 or more times. for example, 
in >s1 there is a 2 times "A" near the end of the sequence.....in >s2 
+there is a 2 times "T" at the 
end...these should not be taken as repeated bases. 

5) Once we choose the repeated bases according to 
point number (step 3), we have to take the 
corresponding ID in the num.txt file.

6) Example: 
-open seq.txt and open num.txt

-process >s1 in both the files.

-positions 4-7 is the repeat... "TTTT" 
in this case.

-the corresponding scores in position 4-7 
is "65 76 54 3"

-there fore its T   T   T   T
                65  76  54  3

-we have to check only the last base SCORE.

-If it is less than 10, we have to delete the score and nucleotide.

-hence the result sequence should be 


>s1
AGCTTTCGGGCAAT

>s1
23 43 45 65 76 54 34 54 65 7 45 56 87 56 

This has to be done for all the sequences 
in the file. >s3 wil not have any change coz, 
the last base of the repeat is more than 10. 
-----------------------------------------------------------
sooooooo, my SAMPLE RESULT FILE for 
the above input should look like:

file number 1 - seq.txt
>s1
AGCTTTCGGCAAT
>s2
GCTGCCCCCATCTT
>s3
TCGTAGCTGAAAATC

file number 2 - num.txt
>s1
23 43 45 65 76 54 34 54 65 45 56 87 56 
>s2
23 43 23 45 65 45 76 78 34 12 32 65 23 25
>s3
12 23 34 45 56 54 43 32 65 43 12 34 75 76 45
----------------------------------------------
[download]

Also, is it possible to find the position where we delete the character?

Comment on deleting a particular character and its coresponding score in 2 different files!!! Download Code

Replies are listed 'Best First'.
Re: deleting a particular character and its coresponding score in 2 different files!!! by ikegami (Patriarch) on Oct 14, 2008 at 05:10 UTC
You've already asked this question. What progress have you made since adding `use warnings;`?	[reply] [d/l]
Re^2: deleting a particular character and its coresponding score in 2 different files!!! by heidi (Sexton) on Oct 14, 2008 at 08:26 UTC
yeah, i tried with use warnings and use strict, here is the program which i tried. #!/usr/bin/perl use strict; use warnings; open(FH1,"read.txt"); open(FH2,"qual.txt"); my @arr1=<FH1>; my @arr2=<FH2>; my $joi1=join(' ',@arr1); my $joi2=join(' ',@arr2); my @new=split('>',$joi1); my @numbers=split('>',$joi2); my @new; my @seqid; foreach(@new){ my($seq_id,$seq)=split(/\n/,$_); push(@alp,split(' ',$seq)); push(@seqid,$seq_id); } my @numbers; my @numid; foreach(@numbers){ my ($num_id,$numb)=split(/\n/,$_); push(@num,split(' ',$numb)) +; push(@numid,$num_id) } my @keep= grep {$_ < 1 \|\| $num[$_]>=10 \|\| $alp[$_-1] ne $alp[$_-0]} 0. +.$#num; print (join(' ',@alp[@keep]),"\n"); print (join(' ',@num[@keep]),"\n"); [download] the result which i got is Use of uninitialized value in split at sample.pl line 15,<FH2> line 6. Use of uninitialized value in split at sample.pl line 21,<FH2> line 6. Use of uninitialized value in string ne at sample.pl line 24, <FH2> li +ne 6. Use of uninitialized value in join or string at sample.pl line 25, <FH +2> line 6. TGACTTTTGCAAAGCTCGTA TGACTTTTGCAAAGCTCGTA TGACTTTTGCAAAGCTCGTA + 34 45 34 23 32 43 54 45 3 +4 12 23 45 54 65 34 23 54 42 34 45 34 23 32 43 54 45 34 12 23 45 54 6 +5 34 23 54 42 34 45 34 23 32 43 54 45 34 12 23 45 54 65 34 23 54 42 [download] Is this problem because i am using more than one string from 2 different files? in the above program, why are we not comparing the character repetition at all? In some cases, if the number is less than 10, but doesnt come into repeted continous characters, it should be removed. Please suggest what can be done. I am not very good at perl :-( Thanks	[reply] [d/l] [select]
Re^3: deleting a particular character and its coresponding score in 2 different files!!! by GrandFather (Saint) on Oct 14, 2008 at 10:26 UTC
No. It is because you are being silly. Some important lines of code are: `my @new = split ('>', $joi1); ... my @new; .. foreach (@new) {` [download] Which of the two @new variables do you expect will be used by the for loop? Why didn't you show us the '"my" variable @new masks earlier declaration in same scope at ...' errors you received or the errors about @alp and @num? Have you actually run the code you've shown us? It helps us a lot to help you if you give use sample code we can actually run and if you show us the result you get and the result you would like from running the code. Oh, and the sample code and results should be short. For example cleaning up your code a little including fixing all stricture warnings and errors: use warnings; use strict; my $seq = <<DATA; >s1 AGCTTTTCGGGCAAT >s2 GCTGCCCCCCATCTT >s3 TCGTAGCTGAAAATC DATA my $num = <<DATA; >s1 23 43 45 65 76 54 3 34 54 65 7 45 56 87 56 >s2 23 43 23 45 65 45 76 78 34 8 12 32 65 23 25 >s3 12 23 34 45 56 54 43 32 65 43 12 34 75 76 45 DATA open FH1, '<', \$seq; open FH2, '<', \$num; my @arr1 = <FH1>; my @arr2 = <FH2>; my $joi1 = join (' ', @arr1); my $joi2 = join (' ', @arr2); my @new = grep length, split ('>', $joi1); my @numbers = grep length, split ('>', $joi2); my @seqid; my @alp; foreach (@new) { my ($seq_id, $seq) = split (/\n/, $_); push (@alp, grep /[ACGT]/, split ('', $seq)); push (@seqid, $seq_id); } my @num; my @numid; foreach (@numbers) { my ($num_id, $numb) = split (/\n/, $_); push (@num, split (' ', $numb)); push ( @numid, $num_id ); } my @keep = grep {$_ < 1 \|\| $num[$_] >= 10 \|\| $alp[$_ - 1] ne $alp[$_ - 0]} 0 +.. $#num; print (join (' ', @alp[@keep]), "\n"); print (join (' ', map {sprintf '%2d', $_} @num[@keep]), "\n"); [download] prints: `A G C T T T C G G C A A T G C T G C C C C C A T + C T T T C G T A G C T G A A A A T C 23 43 45 65 76 54 34 54 65 45 56 87 56 23 43 23 45 65 45 76 78 34 12 3 +2 65 23 25 12 23 34 45 56 54 43 32 65 43 12 34 75 76 45` [download] Perl reduces RSI - it saves typing	[reply] [d/l] [select]
Re^3: deleting a particular character and its coresponding score in 2 different files!!! by apl (Monsignor) on Oct 14, 2008 at 10:00 UTC
Use of uninitialized value in string ne at sample.pl line 24, <FH2> line 6. This refers to `my @keep= grep {$_ < 1 \|\| $num[$_]>=10 \|\| $alp[$_-1] ne $alp[$_-0]} 0..$#num;` ~~This statement appears to be incorrectly constructed. I believe the first argument should be a regex, not a logical conditional.~~ Revised: Please ignore. As this points out, I was wrong.	[reply] [d/l]
Re: deleting a particular character and its coresponding score in 2 different files!!! by gone2015 (Deacon) on Oct 14, 2008 at 14:05 UTC
The posting appears to have been updated considerably... and the specification has shifted. Plus ça change :-( OK. I give up. I really hope this is something useful and not just homework. The program below gives: >s1: A G C T T T T C G G G C A A T 23 43 45 65 76 54 3 34 54 65 7 45 56 87 56 -> A G C T T T C G G C A A T -- dropped 6 10 23 43 45 65 76 54 34 54 65 45 56 87 56 >s2: G C T G C C C C C C A T C T T 23 43 23 45 65 45 76 78 34 8 12 32 65 23 25 -> G C T G C C C C C A T C T T -- dropped 9 23 43 23 45 65 45 76 78 34 12 32 65 23 25 >s3: T C G T A G C T G A A A A T C 12 23 34 45 56 54 43 32 65 43 12 34 75 76 45 -> T C G T A G C T G A A A A T C -- all kept 12 23 34 45 56 54 43 32 65 43 12 34 75 76 45 Read more... gift (5 kB)	[reply] [d/l]
Re^2: deleting a particular character and its coresponding score in 2 different files!!! by Anonymous Monk on Oct 15, 2008 at 06:21 UTC
This one gives me in 2 different files.. #!/usr/bin/perl open FILE,"seq.txt"; open OUT1,">seq.out"; open OUT2,">num.out"; $seq = "";$num = ""; while(<FILE>) { if($_ =~ /^>/) { push(@seq,$seq) if $seq; $seq=''; push(@seq,"\n$_"); } else { $_ =~ s/\s$//; $seq = $seq.$_; } } push(@seq,$seq); close(FILE); open FILE,"num.txt"; while(<FILE>) { if($_ =~ /^>/) { push(@num,$num) if $num; $num=''; push(@num,"\n$_"); } else { $_ =~ s/\s$/ /; $num = $num."$_"; } } push(@num,$num); close(FILE); while($_ = shift @seq) { my @marker; if( $_ =~ />/ ) { $shead = $_; $nhead = shift @num; $seq = shift @seq; $num = shift @num; #print $seq,"\n"; #print $num,"\n"; @s = split (//,$seq); @n = split (/ /,$num); $pre = "null"; $cnt = 1; $arrcnt = -1; foreach(@s) { if($_ eq $pre) { $cnt++; } else { if ($cnt >= 3 && $n[$arrcnt] < 10) { push(@marker,$arrcnt); } $cnt = 1; } $pre = $_; $arrcnt++; } $exp = 0; foreach(@marker) { $_ = $_ - $exp; splice(@n,$_,1); splice(@s,$_,1); $exp++; } $seq = join(" ",@s); $num = join(" ",@n); print OUT1 $shead; print OUT1 $seq,"\n"; print OUT2 $shead; print OUT2 $num,"\n"; } else { print "give proper input file"; exit; } } [download] THANK U ALL FOR THE HELP...Am learning a lot :)	[reply] [d/l]
Re^3: deleting a particular character and its coresponding score in 2 different files!!! by heidi (Sexton) on Oct 15, 2008 at 06:26 UTC
well...thats "myself" on the above program... THANK U ALL FOR THE HELP...Am learning a lot :)	[reply]