in reply to FASTA riddle.
Thanks for the more thorough description of your problem :-)
From your example, it seems like you are doing this a little bit backwards compared to what you seem to aim for: you provide arbitrary motifs on the command line and have the program search for and count them, whereas we suggest you scan the sequences and count statistics for all motifs in one go by looking at sub-segments along the sequences.
Have a look at Perls substr function to extract only a certain number of characters (a motif/window) from the whole sequence. With substr you can also use offsets to walk along the sequence, with or without overlap between the windows.
As the Anonymous Monk said, you can then use that short string as a hash key. The first time you find a particular morif you simply put that as the key into the hash and assign the value 1 to it. You then increment the value by 1 with each subsequent hit.
Things get *much* more difficult if you want to accomplish fuzzy matching (ATA??AA), because then you need to cluster motifs based on similarity, specify cutoffs and use substitution models. That is beyond the scope of your original question. If you aim to do that, and already have a fairly good idea of which relaxed motifs you are searching for, then your original approach is easier because you can use regular expressions with wildcards to find matching motifs. As educated_foo said, you might be better off trying to find some program that does that already.
I can wholeheartedly recommend BioPerl over at http://www.bioperl.org/ for bioinformatics programming of this kind as it abstracts away some of the repetitive low-level tasks (such as parsing a FASTA file) and speeds up development. You need to know or be willing to learn object-oriented programming to benefit from it.
Best of luck! Keep us posted about your progress and do not hesitate to ask further questions.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: FASTA riddle.
by MaroonBalloon (Acolyte) on Dec 12, 2009 at 22:59 UTC | |
by MaroonBalloon (Acolyte) on Dec 12, 2009 at 23:12 UTC |