in reply to Re: What is the best approach to check if a position falls within a target range?
in thread What is the best approach to check if a position falls within a target range?

Hi Monks,
I would very much appreciate any suggestions on optimizing this algorithm. Typically, the query file contains some 2 million records and the target region has about 200,000 records. In these cases, it takes several hours to assign the target status even with the quicksort algorithm. I am thinking of first dividing the target array into pivot regions of 12.5, 25, 50 75, 87.5 and 100, then checking to see if a query snp lies in a bigger chunk before passing the subset array into the quicksort function. It appears that the recursion in the function is a big drag!
Is there a quicker way of achieving this?
Thanks Much,
Uma
  • Comment on Re^2: What is the best approach to check if a position falls within a target range?

Replies are listed 'Best First'.
Re^3: What is the best approach to check if a position falls within a target range?
by BrowserUk (Patriarch) on Feb 12, 2011 at 14:11 UTC
    the query file contains some 2 million records and the target region has about 200,000 records.

    Could you post 3 query file records and 3 target region records?

      Hi BrowserUk,
      query ===== 100 200 300 target_region ============= 10_50 60_150 180_250 expected output =============== 100 60_150 200 180_250
      Thanks Much, Uma

        What are the min and max of these numbers? What are the typical and maximum spans of the regions?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: What is the best approach to check if a position falls within a target range?
by BrowserUk (Patriarch) on Feb 12, 2011 at 21:02 UTC

    FWIW. I've a solution that matches 2e6 random integers (0 .. 1000) against 200,000 randomly generated ranges (0..700, 1..300) in 45 minutes using 3GB of ram.

    Of course, of those 2e6 queries only 1000 are unique so it's doing 2000 more work than it needs to.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Hi BrowserUk,
      Thanks much for all your replies. I did some major refactoring of the code these past few days and was able to achieve significant improvement in speed.
      Major changes are:
      1. split the target region into 24 chunks for 24 chromosomes (chr) and only loaded the chr of interest in memory.
      2. Converted the target hash to a target array. This caused a big gain in speed.
      UPDATE
      3.Divided target region into 8 chunks 10-12.5-25-50 and so on. Checked the if the query snp is in which chunk before assigning target status. Uma

        How long is your current processing taking?