I just wanted to say thanks for your help. I used this method and it works great! Here's my code in case anyone is interested:
#!/usr/bin/perl
=intro
Loci File Filter
This program will accept as input one file of genome loci (such as a V
+CF file)
and one BED file of regions, filtering the loci file to include only t
+he loci
in the regions specified in the BED file. The program should be invoke
+d with
two arguments: first the name of the BED file, then the name of the lo
+ci file.
Specifications:
Both files should be whitespace-delimited
The loci file should have at least 2 columns
column 1 is the chromosome number
column 2 is the coordinate of the locus
any (optional) extra columns will be kept in the output file
The BED file has exactly 3 columns
column 1 is the chromosome number
column 2 is the start coordinate
the start coordinate itself is excluded i.e. output begins fr
+om start coordinate + 1
column 3 is the end coordinate
the end coordinate itself is included
Both files may have header/comment lines beginning with a '#' characte
+r
These lines in the loci file are always included in the output file
+ in their entirety
Both files may or may not have the letters 'chr' before the chromosome
+ number
Both files should be sorted by coordinate, but chromosome order may di
+ffer between files
=cut
# Get input filenames on program invocation:
die "give the BED and loci (VCF) filenames as invocation arguments"
unless @ARGV == 2;
my ( $bed, $loci ) = @ARGV;
# Check order of filenames given on invocation is correct:
my ($ext) = $bed =~ /(\.[^.]+)$/; # get the file extension (should
+be '.bed')
if ( $ext ne '.bed' ) {
die "please enter the BED filename *first*, followed by the loci f
+ilename";
}
# Get 2 input and 1 output filehandles:
open my $bed_file, '<', "$bed" or die "cannot open BED file: $!\n";
open my $loci_file, '<', "$loci" or die "cannot open loci file: $!\n";
open my $out_file, '>', "$loci.filtered_loci"
or die "cannot open output file: $!\n";
# Build a hash structure from the BED file. Each coordinate between th
+e start and
# the end coordinates is given a true value (i.e. 1); everything else
+is excluded:
my %sticks;
while (<$bed_file>) {
chomp;
next if /^#/; # skip lines beginning with '#'
my ( $bed_chr, $bed_start, $bed_end ) = split(/\s+/);
$bed_chr =~ s/\D*(\d+)/$1/; # remove 'chr' (should it exist)
foreach my $coord ( ( $bed_start + 1 ) .. $bed_end ) {
vec( $sticks{$bed_chr} //= '', $coord, 1 ) = 1;
}
}
# Run through each line of the loci file:
while (<$loci_file>) {
chomp;
# keep lines beginning with '#':
if (/^#/) {
print $out_file "$_\n";
next; # go immediately to next line in the loci file
}
# of the remaining lines, only keep those that have a 'true' entry
+ in the BED file hash:
my ( $loci_chr, $loci_pos ) = split(/\s+/); # ignore extra colu
+mns
$loci_chr =~ s/\D*(\d+)/$1/; # remove 'chr' (should it exist)
if ( vec( $sticks{$loci_chr}, $loci_pos, 1 ) ) {
print $out_file "$_\n";
}
}
|