I don't need or want anything proprietary! (But accuracy would help!)

If you have recently run a fuzzy search for short sequences (primers?) (<32 bases) against a (publicly available) long sequence (~1GB or bigger), and have the knowledge/information available to answer the following questions, it would be greatly appreciated.

  1. How long was the big sequence?

    (And preferably -- though not absolutely necessary -- where can I download a copy.)

  2. How many short sequences, and their length(s).

    Figures like approx. 200 around 25-bases is better than nothing.

    205 x average length 19 ranging from 14 to 25 is better.

    A list of exact lengths better yet.

    (Best of all would be a file of the actual sequences used; but I realise that might be verboten.)

  3. How fuzzy?

    Ie. What Hamming distance was acceptable for a match?

    If your run used more complex rules (eg. < 3 insert or deletes and upto 5 transpositions), those details would help.

    Also, if you used one of the BLASTx programs with a minimum "word length"; details of that setting would be important.

  4. How long did the run take?

    Here I really need more than just elapsed (wall clock) time.

    Perfection would be the number of clock cycles or cpu seconds; which would be further enhanced if details of the CPU(s) used was available.

  5. How many match sites were discovered?

    Just the overall number of match sites would suffice.

    Match sites per short sequence would be ideal, assuming that I can have the input sequences as well.

  6. What hardware was the run performed on?

    In some ways this is the most important criteria. CPU type(s); no. of cores/type & clock speeds would be best.

The reason:

I think I've found a better (more accurate and much faster) way to do such fuzzy searches; but before expending lots of effort on putting together a proper package for CPAN -- this is a pure, for fun, home project; not work -- then I'd really like to make some detail comparisons with the current state-of-the-art to convince myself that it a) works; b) is sufficiently faster to warrant the effort.

Basically, I want to run my crude prototype code against a few real (or at least realistic) testcases with known results and timings to see how it stands up before taking it any further.

Thanks for any help you can provide.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

In reply to RFC: A call to bioinformationalists for some generic information. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.