in reply to Re: Comparing array of aligned sequences
in thread Comparing array of aligned sequences
Dear JohnGG, Thank you for your reply and suggestions. The sample data which i provided only has 4 sequences whereas in reality i have files which have greater than 25,000 sequences. In that situation I guess I might have to use loops to do the checking. Kindly let me know if I am thinking in the right direction.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^3: Comparing array of aligned sequences
by johngg (Canon) on Jul 11, 2014 at 21:19 UTC | |
Given the number of sequence lines in your files processing line by line would be the better option. Reading the first sequence into a string that we will compare with subsequent sequences one at a time, modifying the string as we go would be the approach I'd now adopt. The flagDiffs() subroutine takes copies of the base sequence string and the next sequence read from file as arguments and XORs them. The resultant string will have \0 (NULL) characters wherever characters in the two sequences matched. It then uses a regular expression and pos to find non-NULL characters, i.e. non-matches. Finally it modifies the base sequence string by substituting an 'X' at the positions where there were non-matches and returns it, the returned sttring being assigned back to the base sequence string. Here is a command-line example of XOR'ing two strings to demonstrate the process.
Once all lines have been processed the base sequence string can be split on one or more 'X' characters to find the consensus strings.
The output.
Without a 25,000 sequence file to test on I don't know whether this approach will perform but it seems to give the expected results with the sample in your OP. I hope this is helpful. Update: Added the command-line XOR example. Update 2: There's no need to keep updating the base sequence string as we go, just keep a record, as keys of a hash, of where differences are found and modify it after all lines have been processed. Also there's no need to chomp each line (as long as all the sequences are the same length), just do it at the end to the base string. If sequences differ in length then you are into pre-processing to find the shortest or longest sequence then either truncating to the shortest or padding to the longest. New code, the output is the same.
Cheers, JohnGG | [reply] [d/l] [select] |
by newtoperlprog (Sexton) on Jul 14, 2014 at 21:26 UTC | |
Dear JohnGG, Thank you very much for your time and suggestions. It feels good to learn alternate ways to code for the same input. I have another question to ask regarding generating multiple strings from a general rule. I am trying to generate a string of 10 bases based on certain rule:C)2N3(A|G)4-6N7T8(A|T)9(A|T)10 where N could be (A or T or G or C). So according to the above rule, 1st and 2nd position could be a combination of (GG or GC or CG or CC) followed by (A or T or G or C) at position 3 followed by a combination of (AAA or AAG or AGG .. and so on). The thing which I am not able to figure out is do I need to put lot of loops to do it or is there any short way using perl. Thank you for any help or suggestions. | [reply] |
by johngg (Canon) on Jul 14, 2014 at 22:27 UTC | |
The glob function is your friend :-)
Hopefully I've understood your rules correctly and this is helpful. Cheers, JohnGG | [reply] [d/l] |
by perlfan (Parson) on Jul 15, 2014 at 15:43 UTC |