The bigger problem... I have a program that allows users to search text files in FASTA format for an arbitrary number of strings of arbitrary length (motifs of DNA nucleotides to be specific). These strings are typically 6-15 characters long.
I have it currently set up to perform the searches with m//g an so on, where I have converted the user input strings into a regexp, based on whether they input only A,C,G, or T, or whether they used standard degeneracies, which allow things like specifying 'R' to mean either 'A' or 'G', etc. So, if the user inputs 'ART', the search string is actually "A[R|A|G]T", and so on. In this context, specifying 'N' at any point is equivalent to [A|C|G|T|N].
What I am looking to do now, is search for whatever user input string, where, in the 5/6 case of my initial example, any 1 character can match N, but the rest of the string must match exactly. Additionally, I am looking to make this an optional feature, not a given of every search, and I want to make the number of N adjustable.
I hope that clarifies things a bit...
Thanks
Matt | [reply] |
| [reply] |
Thanks for the link, but from what I can tell, I have already independently coded everything found on that page....
Based on the other replies to my OP, it looks like the better thing to do for me would be to find a way to compare the Hemming distance of $string to subsets of the overall sequence, each of the same length as string.
Whether or not I can implement that in the context of the program I've already devised, without a major overhaul is the main question. So, in the meantime, I'll be reading up and playing around withit.
Thanks to all who answered though,
Matt
| [reply] [d/l] |