This does the job in a single pass of each file.
The basic mechanism is to use vec to build bitvectors from your regions file, where the bit is set if it is between a start/end pair for this chromosome. Then as each line of your loci file is read, the position can be looked up directly in the bitvector for that chromosome.
It requires minimal memory and is very fast:
#! perl -slw use strict; use Inline::Files; my %sticks; while( my( $chr, $start, $end ) = split ' ', <REGION> ) { vec( $sticks{ $chr } //='', $_, 1 ) = 1 for $start .. $end; } while( <LOCI> ) { chomp; my( $chr, $pos, @rest ) = split ' ', $_; print if vec( $sticks{ $chr}, $pos, 1 ); } __DATA__ __LOCI__ 1 10 1 20 1 30 1 40 1 50 1 60 1 70 1 80 1 90 1 100 2 10 2 20 2 30 2 40 2 50 2 60 2 70 2 80 2 90 2 100 __REGION__ 1 25 35 1 45 55 1 65 75 1 85 95 2 5 15 2 35 45 3 55 65 2 75 85 __OUTPUT__ C:\test>981656 1 30 1 50 1 70 1 90 2 10 2 40 2 80
In reply to Re: Best approach for large-scale data processing
by BrowserUk
in thread Best approach for large-scale data processing
by iangibson
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |