Re^4: genetic algorithm for motif finding

Does this help a bit?

Some. Though it is still couched in a lot of terms that aren't immediately descriptive to the layman.

Is this close to a paraphrase of the problem?

A motif is a subsequence, of initially unknown length, that repeats, with minor variations, several times within a localised region of a (gene) sequence.
The problem of finding them is that of recognising that there are several near repetitions of a subsequence within a (relatively) short stretch (100s or low thousands) of 'letters'.

If that is close, then a few questions arise:

Is the 'source material' for the search coded in terms of just {acgt}?
Or are the encoded in that other form where 1 letter is used to specif: this position might be any of 'a' or 'c'; or this position might be any of 'a' or 'g' or 't'?
Is there a minimum length to a motif?
Will the repetitions be the same length?
Will the seeker normally have a rough location from (or around) which to start looking?

Maybe I've nothing to contribute to the problem; but I was playing with a novel indexing algorithm a couple of years that might lend itself to this problem. My problem is getting a clear understanding of the problem in terms I can relate to without having to go off and become conversant in the genomic terminology.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^4: genetic algorithm for motif finding

Replies are listed 'Best First'.
Re^5: genetic algorithm for motif finding by bioinformatics (Friar) on Aug 14, 2013 at 02:25 UTC
Sorry about the terminology, it's hard to break from it when you've used it for awhile. The motif is a subsequence, and it may be found several times in a localized region. More often than not, you'll find one or two instances near one another, with other instances very far away (including on other chromosomes). So, you are typically comparing a set of strings from various parts of the genome together. It would bear more similarity to multiple sequence alignment in that sense. The source material is {ATCG} only; the format you are mentioning is what the OP was requesting as output, also known as IUPAC format. Some people encode the motif in terms of bits of information to produce a motif logo letting you know the conservation at each position. Motifs typically range from 4 to 20 or so bases (characters) in length, with some positions in the motif substring being conserved more often than others (ie, if the base at the third position of the motif isn't an A, the protein doesn't bind). The repetitions will be the same or similar length, yes. As for where to start looking, regions of high evolutionary conservation and protein binding sites (via ChIP-Sequencing data) would be common ways to narrow down the regions to look. As an aside, not all repetitive sequence is informative in the same way. There is plenty of repetitive sequence in the genome that has functions outside of protein binding (and the term repetitive sequence has a different meaning than what you might ascribe to it). Tools like repeat-masker are used to identify these regions, and databases of these sequences exist that you could use to determine whether or not an enriched sequence is informative or not. Simple repeats of ATATATATTATTATATATATATAT aren't as likely to be a protein binding site for instance. Bioinformatics	[reply]
Re^6: genetic algorithm for motif finding by BrowserUk (Patriarch) on Aug 14, 2013 at 09:34 UTC
Okay. Thanks for taking the time to clarify things for me. It strikes me that with the possibilities of motifs as short as 4bp, combined with fuzziness; there must literally billions of places where repeats occur within a few dozens of bps. Without there being some additional information about either what to look for, or where to look, or both; this is just a brute force problem without the possibility for clever or interesting solutions. I still have a copy of the human genome I downloaded, which I indexed the wazoo out of when trying out my novel indexer, and I still have the indexes (33gb of highly compressed dbs); but without some criteria upon which to start looking, the best I could do is produce another huge file of possibles that would serve no good purpose. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]