in reply to Re^2: genetic algorithm for motif finding
in thread genetic algorithm for motif finding

A motif, also termed a consensus sequence, is a stretch of DNA that has either the exact (rare) or a similar sequence in many places across the genome; you can think of it like input to a fuzzy search, close but not exact, maybe a misspelling or two. These sites often serve as places where proteins physically interact and bind along the DNA. There are data collections, based on sequencing data, that allow one to know where along the DNA that a given protein binds, and from which one may look for enriched or common motifs by looking at the sequence from the binding coordinates. It may be that these motifs are associated with the protein in question, or they may be motifs for other proteins which interact with the protein you have data for. Collections of motifs in a region (such as the promoter region, where many proteins bind to turn a gene on, off, increased, or decreased--think dimmer switch) can be refered to as cis regulatory regions. You know you found one when you can see an enrichment or increased frequency over some background (control) sequence. Common programs used in this analysis are MEME and nestedMICA. Does this help a bit?

Bioinformatics
  • Comment on Re^3: genetic algorithm for motif finding

Replies are listed 'Best First'.
Re^4: genetic algorithm for motif finding
by BrowserUk (Patriarch) on Aug 13, 2013 at 23:28 UTC
    Does this help a bit?

    Some. Though it is still couched in a lot of terms that aren't immediately descriptive to the layman.

    Is this close to a paraphrase of the problem?

    A motif is a subsequence, of initially unknown length, that repeats, with minor variations, several times within a localised region of a (gene) sequence.

    The problem of finding them is that of recognising that there are several near repetitions of a subsequence within a (relatively) short stretch (100s or low thousands) of 'letters'.

    If that is close, then a few questions arise:

    • Is the 'source material' for the search coded in terms of just {acgt}?

      Or are the encoded in that other form where 1 letter is used to specif: this position might be any of 'a' or 'c'; or this position might be any of 'a' or 'g' or 't'?

    • Is there a minimum length to a motif?
    • Will the repetitions be the same length?
    • Will the seeker normally have a rough location from (or around) which to start looking?

    Maybe I've nothing to contribute to the problem; but I was playing with a novel indexing algorithm a couple of years that might lend itself to this problem. My problem is getting a clear understanding of the problem in terms I can relate to without having to go off and become conversant in the genomic terminology.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Sorry about the terminology, it's hard to break from it when you've used it for awhile. The motif is a subsequence, and it may be found several times in a localized region. More often than not, you'll find one or two instances near one another, with other instances very far away (including on other chromosomes). So, you are typically comparing a set of strings from various parts of the genome together. It would bear more similarity to multiple sequence alignment in that sense.

      The source material is {ATCG} only; the format you are mentioning is what the OP was requesting as output, also known as IUPAC format. Some people encode the motif in terms of bits of information to produce a motif logo letting you know the conservation at each position. Motifs typically range from 4 to 20 or so bases (characters) in length, with some positions in the motif substring being conserved more often than others (ie, if the base at the third position of the motif isn't an A, the protein doesn't bind). The repetitions will be the same or similar length, yes. As for where to start looking, regions of high evolutionary conservation and protein binding sites (via ChIP-Sequencing data) would be common ways to narrow down the regions to look.

      As an aside, not all repetitive sequence is informative in the same way. There is plenty of repetitive sequence in the genome that has functions outside of protein binding (and the term repetitive sequence has a different meaning than what you might ascribe to it). Tools like repeat-masker are used to identify these regions, and databases of these sequences exist that you could use to determine whether or not an enriched sequence is informative or not. Simple repeats of ATATATATTATTATATATATATAT aren't as likely to be a protein binding site for instance.

      Bioinformatics

        Okay. Thanks for taking the time to clarify things for me.

        It strikes me that with the possibilities of motifs as short as 4bp, combined with fuzziness; there must literally billions of places where repeats occur within a few dozens of bps. Without there being some additional information about either what to look for, or where to look, or both; this is just a brute force problem without the possibility for clever or interesting solutions.

        I still have a copy of the human genome I downloaded, which I indexed the wazoo out of when trying out my novel indexer, and I still have the indexes (33gb of highly compressed dbs); but without some criteria upon which to start looking, the best I could do is produce another huge file of possibles that would serve no good purpose.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.