in reply to Best approach for large-scale data processing

This does the job in a single pass of each file.

The basic mechanism is to use vec to build bitvectors from your regions file, where the bit is set if it is between a start/end pair for this chromosome. Then as each line of your loci file is read, the position can be looked up directly in the bitvector for that chromosome.

It requires minimal memory and is very fast:

#! perl -slw use strict; use Inline::Files; my %sticks; while( my( $chr, $start, $end ) = split ' ', <REGION> ) { vec( $sticks{ $chr } //='', $_, 1 ) = 1 for $start .. $end; } while( <LOCI> ) { chomp; my( $chr, $pos, @rest ) = split ' ', $_; print if vec( $sticks{ $chr}, $pos, 1 ); } __DATA__ __LOCI__ 1 10 1 20 1 30 1 40 1 50 1 60 1 70 1 80 1 90 1 100 2 10 2 20 2 30 2 40 2 50 2 60 2 70 2 80 2 90 2 100 __REGION__ 1 25 35 1 45 55 1 65 75 1 85 95 2 5 15 2 35 45 3 55 65 2 75 85 __OUTPUT__ C:\test>981656 1 30 1 50 1 70 1 90 2 10 2 40 2 80

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^2: Best approach for large-scale data processing
by locked_user sundialsvc4 (Abbot) on Jul 13, 2012 at 20:59 UTC

    An inspired solution, and an intro to a new feature for me.

    ++”ed

Re^2: Best approach for large-scale data processing
by iangibson (Scribe) on Jul 23, 2012 at 21:14 UTC

    I just wanted to say thanks for your help. I used this method and it works great! Here's my code in case anyone is interested:

    #!/usr/bin/perl =intro Loci File Filter This program will accept as input one file of genome loci (such as a V +CF file) and one BED file of regions, filtering the loci file to include only t +he loci in the regions specified in the BED file. The program should be invoke +d with two arguments: first the name of the BED file, then the name of the lo +ci file. Specifications: Both files should be whitespace-delimited The loci file should have at least 2 columns column 1 is the chromosome number column 2 is the coordinate of the locus any (optional) extra columns will be kept in the output file The BED file has exactly 3 columns column 1 is the chromosome number column 2 is the start coordinate the start coordinate itself is excluded i.e. output begins fr +om start coordinate + 1 column 3 is the end coordinate the end coordinate itself is included Both files may have header/comment lines beginning with a '#' characte +r These lines in the loci file are always included in the output file + in their entirety Both files may or may not have the letters 'chr' before the chromosome + number Both files should be sorted by coordinate, but chromosome order may di +ffer between files =cut # Get input filenames on program invocation: die "give the BED and loci (VCF) filenames as invocation arguments" unless @ARGV == 2; my ( $bed, $loci ) = @ARGV; # Check order of filenames given on invocation is correct: my ($ext) = $bed =~ /(\.[^.]+)$/; # get the file extension (should +be '.bed') if ( $ext ne '.bed' ) { die "please enter the BED filename *first*, followed by the loci f +ilename"; } # Get 2 input and 1 output filehandles: open my $bed_file, '<', "$bed" or die "cannot open BED file: $!\n"; open my $loci_file, '<', "$loci" or die "cannot open loci file: $!\n"; open my $out_file, '>', "$loci.filtered_loci" or die "cannot open output file: $!\n"; # Build a hash structure from the BED file. Each coordinate between th +e start and # the end coordinates is given a true value (i.e. 1); everything else +is excluded: my %sticks; while (<$bed_file>) { chomp; next if /^#/; # skip lines beginning with '#' my ( $bed_chr, $bed_start, $bed_end ) = split(/\s+/); $bed_chr =~ s/\D*(\d+)/$1/; # remove 'chr' (should it exist) foreach my $coord ( ( $bed_start + 1 ) .. $bed_end ) { vec( $sticks{$bed_chr} //= '', $coord, 1 ) = 1; } } # Run through each line of the loci file: while (<$loci_file>) { chomp; # keep lines beginning with '#': if (/^#/) { print $out_file "$_\n"; next; # go immediately to next line in the loci file } # of the remaining lines, only keep those that have a 'true' entry + in the BED file hash: my ( $loci_chr, $loci_pos ) = split(/\s+/); # ignore extra colu +mns $loci_chr =~ s/\D*(\d+)/$1/; # remove 'chr' (should it exist) if ( vec( $sticks{$loci_chr}, $loci_pos, 1 ) ) { print $out_file "$_\n"; } }