umasuresh has asked for the wisdom of the Perl Monks concerning the following question:

I am interested in checking if a SNP/position lies within a target region. These are the steps that I have in my script.
  1. Load target regions into a Hash of Array (chrN -> [coord1,coord2,coord3,...])
  2. Load SNP lines into an array
  3. Foreach SNP, if chrN key exists in the HOA, check if the coordN falls within a certain target region for each chromosome
  4. If yes, print the SNP line, else skip it
My script prints what I need, but is incredibly slow if there are say 10,000 lines or more in the input file.
Is there an quicker way to achieve the same?
I greatly appreciate any input in this regard!
  • Comment on What is the best approach to check if a position falls within a target range?
  • Download Code

Replies are listed 'Best First'.
Re: What is the best approach to check if a position falls within a target range?
by Illuminatus (Curate) on Nov 10, 2010 at 19:51 UTC
    Sorry, but I'm not very familiar with SNP. Are the values represented by 'coord1', etc fully unique, or do they repeat per chromosome (or something else)? If the coord values are unique, you might be able to trade space for time. You could create a second target region hash, key'ed by the coord's, and have the chr for each coord as its value. Then, instead of walking an array every time, you would just check if chr was in your current HOA, and the coord was in the new hash. This should be faster, if most arrays are of non-trivial length

    fnord

Re: What is the best approach to check if a position falls within a target range?
by Anonymous Monk on Nov 10, 2010 at 19:40 UTC
    If step #2 is doing what it sounds like you're doing, you're making a (shallow) copy of the array, which will double the amount of memory used for that data. Instead, work with the array ref directly. This may apply for step #3 too, depending on how that's implemented.

    For example, instead of
    my @data = @{ $hash{chrN} }; for my $SNP (@data) { ... }
    do
    for my $SNP (@{ $hash{chrN} }) { ... }
    or
    my $data_ref = $hash{chrN}; for my $SNP (@$data_ref) { ... }

Re: What is the best approach to check if a position falls within a target range?
by umasuresh (Hermit) on Nov 12, 2010 at 18:05 UTC
    Thanks Illuminatus and Anonymous monks for your valuable suggestions. I have been researching on sorting algorithms and got inspired to try something based on the Quicksort Algorithm explained in Mastering Algorithms with Perl. Here is what I tried which I hope will significantly reduce the target assignment. I have tried this on a test array and yet to check its performance on a huge SNP file. But here is what I have so far ... Any feedback is greatly appreciated! UPDATE: End points with just two elements in an array will lead to out of memory errors!!! I am fixing this issue. Will post soon...
Re: What is the best approach to check if a position falls within a target range?
by umasuresh (Hermit) on Nov 15, 2010 at 22:16 UTC
    Hi Monks, I am posting an updated version of the code which significantly reduced the time for assigning target status (less than 1/2 the time compared to the original version). Any input is greatly appreciated!
      Hi Monks,
      I would very much appreciate any suggestions on optimizing this algorithm. Typically, the query file contains some 2 million records and the target region has about 200,000 records. In these cases, it takes several hours to assign the target status even with the quicksort algorithm. I am thinking of first dividing the target array into pivot regions of 12.5, 25, 50 75, 87.5 and 100, then checking to see if a query snp lies in a bigger chunk before passing the subset array into the quicksort function. It appears that the recursion in the function is a big drag!
      Is there a quicker way of achieving this?
      Thanks Much,
      Uma
        the query file contains some 2 million records and the target region has about 200,000 records.

        Could you post 3 query file records and 3 target region records?

        FWIW. I've a solution that matches 2e6 random integers (0 .. 1000) against 200,000 randomly generated ranges (0..700, 1..300) in 45 minutes using 3GB of ram.

        Of course, of those 2e6 queries only 1000 are unique so it's doing 2000 more work than it needs to.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: What is the best approach to check if a position falls within a target range?
by umasuresh (Hermit) on Aug 31, 2011 at 15:49 UTC
    Hi All, A neat and quick solution for this task was suggested by my colleague +and I want to share it with you all. 1. Load SNPs into a hash 2. Loop through the target region which is of the format: chr start end 3. Foreach target region, set a counter, as long as the counter is les +s than end, check if each position in the target region is in the SNP + hash, if so, print the SNP hash value. Since the SNPs is in a hash, the lookup is really quick !