Re: What is the best approach to check if a position falls within a target range?
by Illuminatus (Curate) on Nov 10, 2010 at 19:51 UTC
|
Sorry, but I'm not very familiar with SNP. Are the values represented by 'coord1', etc fully unique, or do they repeat per chromosome (or something else)? If the coord values are unique, you might be able to trade space for time. You could create a second target region hash, key'ed by the coord's, and have the chr for each coord as its value. Then, instead of walking an array every time, you would just check if chr was in your current HOA, and the coord was in the new hash. This should be faster, if most arrays are of non-trivial lengthfnord | [reply] |
Re: What is the best approach to check if a position falls within a target range?
by Anonymous Monk on Nov 10, 2010 at 19:40 UTC
|
If step #2 is doing what it sounds like you're doing, you're making a (shallow) copy of the array, which will double the amount of memory used for that data. Instead, work with the array ref directly. This may apply for step #3 too, depending on how that's implemented.
For example, instead of
my @data = @{ $hash{chrN} }; for my $SNP (@data) { ... }
do for my $SNP (@{ $hash{chrN} }) { ... }
or my $data_ref = $hash{chrN}; for my $SNP (@$data_ref) { ... } | [reply] [d/l] [select] |
Re: What is the best approach to check if a position falls within a target range?
by umasuresh (Hermit) on Nov 12, 2010 at 18:05 UTC
|
Thanks Illuminatus and Anonymous monks for your valuable suggestions. I have been researching on sorting algorithms and got inspired to try something based on the Quicksort Algorithm explained in Mastering Algorithms with Perl. Here is what I tried which I hope will significantly reduce the target assignment. I have tried this on a test array and yet to check its performance on a huge SNP file. But here is what I have so far ...
Any feedback is greatly appreciated!
UPDATE: End points with just two elements in an array will lead to out of memory errors!!! I am fixing this issue. Will post soon... | [reply] [d/l] |
Re: What is the best approach to check if a position falls within a target range?
by umasuresh (Hermit) on Nov 15, 2010 at 22:16 UTC
|
Hi Monks,
I am posting an updated version of the code which significantly reduced the time for assigning target status (less than 1/2 the time compared to the original version).
Any input is greatly appreciated! | [reply] [d/l] |
|
|
Hi Monks,
I would very much appreciate any suggestions on optimizing this algorithm. Typically, the query file contains some 2 million records and the target region has about 200,000 records. In these cases, it takes several hours to assign the target status even with the quicksort algorithm. I am thinking of first dividing the target array into pivot regions of 12.5, 25, 50 75, 87.5 and 100, then checking to see if a query snp lies in a bigger chunk before passing the subset array into the quicksort function. It appears that the recursion in the function is a big drag! Is there a quicker way of achieving this?
Thanks Much,
Uma
| [reply] |
|
|
| [reply] |
|
|
|
|
|
|
|
FWIW. I've a solution that matches 2e6 random integers (0 .. 1000) against 200,000 randomly generated ranges (0..700, 1..300) in 45 minutes using 3GB of ram.
Of course, of those 2e6 queries only 1000 are unique so it's doing 2000 more work than it needs to.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
|
|
Re: What is the best approach to check if a position falls within a target range?
by umasuresh (Hermit) on Aug 31, 2011 at 15:49 UTC
|
Hi All,
A neat and quick solution for this task was suggested by my colleague
+and I want to share it with you all.
1. Load SNPs into a hash
2. Loop through the target region which is of the format:
chr start end
3. Foreach target region, set a counter, as long as the counter is les
+s than end, check if each position in the target region is in the SNP
+ hash, if so, print the SNP hash value.
Since the SNPs is in a hash, the lookup is really quick !
| [reply] [d/l] |