in reply to Re^7: What is the best approach to check if a position falls within a target range?
in thread What is the best approach to check if a position falls within a target range?

Hi BrowserUk,
The target region is a chromosome bed file which spans on an average a 300 base region on any chromosome. Let us say the target region contains incremental regions of 300 each. I am trying to identify if a SNP falls in the target region. 2**16 or 2**32 as max is really too big!
target region ============= chr1 100 400 chr1 450 750 chr1 780 980 ...
Thanks Much,
Uma
  • Comment on Re^8: What is the best approach to check if a position falls within a target range?
  • Download Code

Replies are listed 'Best First'.
Re^9: What is the best approach to check if a position falls within a target range?
by BrowserUk (Patriarch) on Feb 12, 2011 at 15:25 UTC

    Hm. A file containing 2 million queries consisting of a single integer in the range 1 .. 1000, is going to contain (on average) 1000 duplicates of each query?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Hi BrowserUk,
      I am sorry I don't understand your question!
      I am trying to output something like this:
      Query ===== 89 100 200 416 500 780 Target ====== 100_400 420_720 800_1100 Output ======= 89 not_in_target 100 100_400 200 100_400 416 not_in_target 500 420_720 780 not_in_target
        I am sorry I don't understand your question!

        You said that the query file contains 2 million queries.

        You said that the maximum number is much less than 2**16, and your in your sample data all the numbers are less than 1000.

        If all the numbers are less than 1000 and you have 2 million of them, then on average, each query will be duplicated 2000 times.

        Which suggests that if you pre-processed your queries file to remove the duplication, you would end up with 1000 queries only and so would reduce the amount of work to do by 3 orders of magnitude. Ie. Your processing would be 2000 times faster.

        All of which suggests either your queries are more than just a single integer, or the range is much larger. Which is it?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.