iangibson has asked for the wisdom of the Perl Monks concerning the following question:
I'd like to create a Perl program that accepts as input one file of genome loci (potentially 10's of GB in size) and one region file (typically around 1GB), and I'm trying to think of the best approach to filter the loci file based on the coordinates in the region file.
The loci file looks like this (the lines are truncated here; the important columns are the first two: chromosome and position, although I'd keep the whole like for each match):
#CHROM POS ID REF ALT QUAL FILTER IN 1 874816 . C CT 10 FINAL_AMBG 1 1647893 . C CTTTCTT 30 FINAL_NOT_ 1 7889972 rs57875989 GAGAATCCATCCCATCCTACTGCCAG 1 14106394 . A ACTC 100 FI 1 22332156 . AGG A 10 FI 1 22332161 . T TC 0 FI
The region file looks like this:
1 69091 69290 1 69291 69490 1 69491 69690 1 69691 70008 1 861321 861393 1 865535 865716 1 866418 866469
The first column is the chromosome, the second column is the start coordinate, and the third column is the end coordinate.
What I'd like to do is only keep the lines from the first (loci) file whose 'POS' is between the start and end coordinates on the second (region) file.
A nested while loop would seem to be grossly inefficient, so I was thinking of building a hash of arrays. But before I do, I'd like to draw on the wisdom of the Monks on whether there's a better approach. Execution speed and memory efficiency are paramount. Suggestions much appreciated.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Best approach for large-scale data processing
by BrowserUk (Patriarch) on Jul 13, 2012 at 19:29 UTC | |
by locked_user sundialsvc4 (Abbot) on Jul 13, 2012 at 20:59 UTC | |
by iangibson (Scribe) on Jul 23, 2012 at 21:14 UTC | |
|
Re: Best approach for large-scale data processing
by davido (Cardinal) on Jul 13, 2012 at 17:24 UTC | |
|
Re: Best approach for large-scale data processing
by frozenwithjoy (Priest) on Jul 13, 2012 at 17:49 UTC |