comment on

I just wanted to say thanks for your help. I used this method and it works great! Here's my code in case anyone is interested:

#!/usr/bin/perl

=intro
Loci File Filter

This program will accept as input one file of genome loci (such as a V
+CF file)
and one BED file of regions, filtering the loci file to include only t
+he loci
in the regions specified in the BED file. The program should be invoke
+d with
two arguments: first the name of the BED file, then the name of the lo
+ci file.

Specifications:
Both files should be whitespace-delimited 
   The loci file should have at least 2 columns
      column 1 is the chromosome number
      column 2 is the coordinate of the locus
         any (optional) extra columns will be kept in the output file
   The BED file has exactly 3 columns
      column 1 is the chromosome number
      column 2 is the start coordinate
         the start coordinate itself is excluded i.e. output begins fr
+om start coordinate + 1
      column 3 is the end coordinate
         the end coordinate itself is included
Both files may have header/comment lines beginning with a '#' characte
+r
   These lines in the loci file are always included in the output file
+ in their entirety
Both files may or may not have the letters 'chr' before the chromosome
+ number
Both files should be sorted by coordinate, but chromosome order may di
+ffer between files
=cut

# Get input filenames on program invocation:
die "give the BED and loci (VCF) filenames as invocation arguments"
  unless @ARGV == 2;
my ( $bed, $loci ) = @ARGV;

# Check order of filenames given on invocation is correct:
my ($ext) = $bed =~ /(\.[^.]+)$/;    # get the file extension (should 
+be '.bed')
if ( $ext ne '.bed' ) {
    die "please enter the BED filename *first*, followed by the loci f
+ilename";
}

# Get 2 input and 1 output filehandles:
open my $bed_file,  '<', "$bed"  or die "cannot open BED file: $!\n";
open my $loci_file, '<', "$loci" or die "cannot open loci file: $!\n";
open my $out_file, '>', "$loci.filtered_loci"
  or die "cannot open output file: $!\n";

# Build a hash structure from the BED file. Each coordinate between th
+e start and
# the end coordinates is given a true value (i.e. 1); everything else 
+is excluded:
my %sticks;
while (<$bed_file>) {
    chomp;
    next if /^#/;   # skip lines beginning with '#'

    my ( $bed_chr, $bed_start, $bed_end ) = split(/\s+/);
    $bed_chr =~ s/\D*(\d+)/$1/;    # remove 'chr' (should it exist)

    foreach my $coord ( ( $bed_start + 1 ) .. $bed_end ) {
        vec( $sticks{$bed_chr} //= '', $coord, 1 ) = 1;
    }
}

# Run through each line of the loci file:
while (<$loci_file>) {
    chomp;

    # keep lines beginning with '#':
    if (/^#/) {
        print $out_file "$_\n";
        next;    # go immediately to next line in the loci file
    }

    # of the remaining lines, only keep those that have a 'true' entry
+ in the BED file hash:
    my ( $loci_chr, $loci_pos ) = split(/\s+/);    # ignore extra colu
+mns
    $loci_chr =~ s/\D*(\d+)/$1/;    # remove 'chr' (should it exist)

    if ( vec( $sticks{$loci_chr}, $loci_pos, 1 ) ) {
        print $out_file "$_\n";
    }
}
[download]

In reply to Re^2: Best approach for large-scale data processing by iangibson
in thread Best approach for large-scale data processing by iangibson

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.