in reply to Re^2: Filtering matches of near-perfect-matched DNA sequence pairs
in thread Filtering matches of near-perfect-matched DNA sequence pairs

Thank you. I still don't understand, though. What is the input? Are the dashes already there?
لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
  • Comment on Re^3: Filtering matches of near-perfect-matched DNA sequence pairs

Replies are listed 'Best First'.
Re^4: Filtering matches of near-perfect-matched DNA sequence pairs
by Anonymous Monk on Mar 14, 2015 at 00:31 UTC

    And if the dashes are not there, are both sequences still 10 long? Unlike what you showed...

Re^4: Filtering matches of near-perfect-matched DNA sequence pairs
by onlyIDleft (Scribe) on Mar 15, 2015 at 02:00 UTC

    Nope, the dashes are to help the reader understand where there is a is an insertion/deletion (indel) event, and not a substitution of a letter.

    Such a gap caused by indel(s). i.e. absence of an aligned letter is commonly signified by the '-' symbol in sequence alignments by biologists.

    You may replace it in your mind with just a blank space if that helps you. Hope that clarifies it.

      Then what actual input caused the strings with the dashes in them? Are all strings with - in them a result instead of an input?

      This just makes it more unclear exactly what the inputs to your problem are. You are showing sequences with dashes and sequences with only nine characters in them, in contradiction to your original problem statement. (two 10 character strings)

      Please make up a test file with the *real* input sequences in it, and also show the expected output for each input.

        I suspect, based on previous similar questions of this type, that what the op has is:

        1. A bunch of long sequences; possibly 1000s or 100,000s bytes/codons/other long.
        2. A bunch of shorter sequences perhaps 10 chars, perhaps 9 or 10 chars, long.

        And the process he's trying to code is:

        • For each of the long sequences...
        • For each of the short sequences...
        • Scan the longer sequence looking for sites where the shorter sequence 'matches' and record those positions.

        The complication is that the matching is 'fuzzy' within his set of constraints:

        • such that a 9 character subsection of the larger sequence may be considered a match to a 10-character short sequence, if (for example) the removal of any (exactly one) character from the 10 character sequence allows it to match the 9 character subsection of the longer sequence.
        • Or, a 10-character short sequence may be considered a match for a 10-character subsection of the longer sequence if they differ in exactly (only) one position.

        But, that's just supposition until he answers somebody's questions!


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked