Re^2: Comparing array of aligned sequences

Replies are listed 'Best First'.
Re^3: Comparing array of aligned sequences by johngg (Canon) on Jul 11, 2014 at 21:19 UTC
Given the number of sequence lines in your files processing line by line would be the better option. Reading the first sequence into a string that we will compare with subsequent sequences one at a time, modifying the string as we go would be the approach I'd now adopt. The `flagDiffs()` subroutine takes copies of the base sequence string and the next sequence read from file as arguments and `XOR`s them. The resultant string will have `\0` (NULL) characters wherever characters in the two sequences matched. It then uses a regular expression and pos to find non-NULL characters, i.e. non-matches. Finally it modifies the base sequence string by substituting an 'X' at the positions where there were non-matches and returns it, the returned sttring being assigned back to the base sequence string. Here is a command-line example of XOR'ing two strings to demonstrate the process. `$ perl -E ' $s1 = q{gctgc}; $s2 = q{gctcc}; $x = $s1 ^ $s2; say for map sprintf( q{0x%02x}, ord ), split m{}, $x;' 0x00 0x00 0x00 0x04 0x00 $` [download] Once all lines have been processed the base sequence string can be split on one or more 'X' characters to find the consensus strings. use strict; use warnings; use 5.014; open my $inFH, q{<}, \ <<EOD or die $!; atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atagctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgtttggggctctgccgccctcttctccgcctgccgttcagg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctaccgccctcttctccgcctgccgttccgg +c EOD my $baseStr = <$inFH>; chomp $baseStr; while ( <$inFH> ) { chomp; $baseStr = flagDiffs( $baseStr, $_ ); } close $inFH or die $!; say for split m{X+}, $baseStr; sub flagDiffs { my( $baseStr, $nextStr ) = @_; my $diff = $baseStr ^ $nextStr; my @posns; push @posns, pos $diff while $diff =~ m{(?=[^\0])}g; substr $baseStr, $_, 1, q{X} for @posns; return $baseStr; } [download] The output. `at gctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctgaat +cctgcggacgacccctctcgtggtcg ttggggctct ccgccctcttctccgcctgccgttc ggc` [download] Without a 25,000 sequence file to test on I don't know whether this approach will perform but it seems to give the expected results with the sample in your OP. I hope this is helpful. Update: Added the command-line XOR example. Update 2: There's no need to keep updating the base sequence string as we go, just keep a record, as keys of a hash, of where differences are found and modify it after all lines have been processed. Also there's no need to chomp each line (as long as all the sequences are the same length), just do it at the end to the base string. If sequences differ in length then you are into pre-processing to find the shortest or longest sequence then either truncating to the shortest or padding to the longest. New code, the output is the same. use strict; use warnings; use 5.014; open my $inFH, q{<}, \ <<EOD or die $!; atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atagctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgtttggggctctgccgccctcttctccgcctgccgttcagg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctgccgccctcttctccgcctgccgttccgg +c atggctgctaggctgtgctgccaactggatcctgcgcgggacgtcctttgtctacgtcccgtcggcgctg +aatcctgcggacgacccctctcgtggtcgcttggggctctaccgccctcttctccgcctgccgttccgg +c EOD my $baseStr = <$inFH>; my %diffPosns; while ( <$inFH> ) { findDiffPosns( $baseStr, $_ ); } close $inFH or die $!; chomp $baseStr; substr $baseStr, $_, 1, q{X} for keys %diffPosns; say for split m{X+}, $baseStr; sub findDiffPosns { my( $baseStr, $nextStr ) = @_; my $diff = $baseStr ^ $nextStr; $diffPosns{ pos $diff } ++ while $diff =~ m{(?=[^\0])}g; } [download] Cheers, JohnGG	[reply] [d/l] [select]
Re^4: Comparing array of aligned sequences by newtoperlprog (Sexton) on Jul 14, 2014 at 21:26 UTC
Dear JohnGG, Thank you very much for your time and suggestions. It feels good to learn alternate ways to code for the same input. I have another question to ask regarding generating multiple strings from a general rule. I am trying to generate a string of 10 bases based on certain rule:C)2N3(A\|G)4-6N7T8(A\|T)9(A\|T)10 where N could be (A or T or G or C). So according to the above rule, 1st and 2nd position could be a combination of (GG or GC or CG or CC) followed by (A or T or G or C) at position 3 followed by a combination of (AAA or AAG or AGG .. and so on). The thing which I am not able to figure out is do I need to put lot of loops to do it or is there any short way using perl. Thank you for any help or suggestions.	[reply]
Re^5: Comparing array of aligned sequences by johngg (Canon) on Jul 14, 2014 at 22:27 UTC
The glob function is your friend :-) $ perl -Mstrict -Mwarnings -E ' my $rule = q{{c,g}} x 2; $rule .= q{{a,t,g,c}}; $rule .= q{{a,g}} x 3; $rule .= q{{a,t,g,c}}; $rule .= q{t}; $rule .= q{{a,t}} x 2; say $rule; say for glob $rule;' {c,g}{c,g}{a,t,g,c}{a,g}{a,g}{a,g}{a,t,g,c}t{a,t}{a,t} ccaaaaataa ccaaaaatat ccaaaaatta ccaaaaattt ccaaaattaa ccaaaattat ccaaaattta ccaaaatttt ccaaaagtaa ccaaaagtat ccaaaagtta ccaaaagttt ccaaaactaa ccaaaactat ... ggcgggatat ggcgggatta ggcgggattt ggcgggttaa ggcgggttat ggcgggttta ggcgggtttt ggcggggtaa ggcggggtat ggcggggtta ggcggggttt ggcgggctaa ggcgggctat ggcgggctta ggcgggcttt $ [download] Hopefully I've understood your rules correctly and this is helpful. Cheers, JohnGG	[reply] [d/l]
Re^6: Comparing array of aligned sequences by perlfan (Parson) on Jul 15, 2014 at 15:43 UTC