Best approach for large-scale data processing

iangibson has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to create a Perl program that accepts as input one file of genome loci (potentially 10's of GB in size) and one region file (typically around 1GB), and I'm trying to think of the best approach to filter the loci file based on the coordinates in the region file.

The loci file looks like this (the lines are truncated here; the important columns are the first two: chromosome and position, although I'd keep the whole like for each match):

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    IN
1    874816    .    C    CT    10    FINAL_AMBG
1    1647893    .    C    CTTTCTT    30    FINAL_NOT_
1    7889972    rs57875989    GAGAATCCATCCCATCCTACTGCCAG
1    14106394    .    A    ACTC    100    FI
1    22332156    .    AGG    A    10    FI
1    22332161    .    T    TC    0    FI
[download]

The region file looks like this:

1    69091    69290
1    69291    69490
1    69491    69690
1    69691    70008
1    861321    861393
1    865535    865716
1    866418    866469
[download]

The first column is the chromosome, the second column is the start coordinate, and the third column is the end coordinate.

What I'd like to do is only keep the lines from the first (loci) file whose 'POS' is between the start and end coordinates on the second (region) file.

A nested while loop would seem to be grossly inefficient, so I was thinking of building a hash of arrays. But before I do, I'd like to draw on the wisdom of the Monks on whether there's a better approach. Execution speed and memory efficiency are paramount. Suggestions much appreciated.

Comment on Best approach for large-scale data processing Select or Download Code

Replies are listed 'Best First'.
Re: Best approach for large-scale data processing by BrowserUk (Patriarch) on Jul 13, 2012 at 19:29 UTC
This does the job in a single pass of each file. The basic mechanism is to use vec to build bitvectors from your regions file, where the bit is set if it is between a start/end pair for this chromosome. Then as each line of your loci file is read, the position can be looked up directly in the bitvector for that chromosome. It requires minimal memory and is very fast: #! perl -slw use strict; use Inline::Files; my %sticks; while( my( $chr, $start, $end ) = split ' ', <REGION> ) { vec( $sticks{ $chr } //='', $_, 1 ) = 1 for $start .. $end; } while( <LOCI> ) { chomp; my( $chr, $pos, @rest ) = split ' ', $_; print if vec( $sticks{ $chr}, $pos, 1 ); } __DATA__ __LOCI__ 1 10 1 20 1 30 1 40 1 50 1 60 1 70 1 80 1 90 1 100 2 10 2 20 2 30 2 40 2 50 2 60 2 70 2 80 2 90 2 100 __REGION__ 1 25 35 1 45 55 1 65 75 1 85 95 2 5 15 2 35 45 3 55 65 2 75 85 __OUTPUT__ C:\test>981656 1 30 1 50 1 70 1 90 2 10 2 40 2 80 [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re^2: Best approach for large-scale data processing by locked_user sundialsvc4 (Abbot) on Jul 13, 2012 at 20:59 UTC
An inspired solution, and an intro to a new feature for me. “`++`”ed
Re^2: Best approach for large-scale data processing by iangibson (Scribe) on Jul 23, 2012 at 21:14 UTC
I just wanted to say thanks for your help. I used this method and it works great! Here's my code in case anyone is interested: #!/usr/bin/perl =intro Loci File Filter This program will accept as input one file of genome loci (such as a V +CF file) and one BED file of regions, filtering the loci file to include only t +he loci in the regions specified in the BED file. The program should be invoke +d with two arguments: first the name of the BED file, then the name of the lo +ci file. Specifications: Both files should be whitespace-delimited The loci file should have at least 2 columns column 1 is the chromosome number column 2 is the coordinate of the locus any (optional) extra columns will be kept in the output file The BED file has exactly 3 columns column 1 is the chromosome number column 2 is the start coordinate the start coordinate itself is excluded i.e. output begins fr +om start coordinate + 1 column 3 is the end coordinate the end coordinate itself is included Both files may have header/comment lines beginning with a '#' characte +r These lines in the loci file are always included in the output file + in their entirety Both files may or may not have the letters 'chr' before the chromosome + number Both files should be sorted by coordinate, but chromosome order may di +ffer between files =cut # Get input filenames on program invocation: die "give the BED and loci (VCF) filenames as invocation arguments" unless @ARGV == 2; my ( $bed, $loci ) = @ARGV; # Check order of filenames given on invocation is correct: my ($ext) = $bed =~ /(\.[^.]+)$/; # get the file extension (should +be '.bed') if ( $ext ne '.bed' ) { die "please enter the BED filename first, followed by the loci f +ilename"; } # Get 2 input and 1 output filehandles: open my $bed_file, '<', "$bed" or die "cannot open BED file: $!\n"; open my $loci_file, '<', "$loci" or die "cannot open loci file: $!\n"; open my $out_file, '>', "$loci.filtered_loci" or die "cannot open output file: $!\n"; # Build a hash structure from the BED file. Each coordinate between th +e start and # the end coordinates is given a true value (i.e. 1); everything else +is excluded: my %sticks; while (<$bed_file>) { chomp; next if /^#/; # skip lines beginning with '#' my ( $bed_chr, $bed_start, $bed_end ) = split(/\s+/); $bed_chr =~ s/\D(\d+)/$1/; # remove 'chr' (should it exist) foreach my $coord ( ( $bed_start + 1 ) .. $bed_end ) { vec( $sticks{$bed_chr} //= '', $coord, 1 ) = 1; } } # Run through each line of the loci file: while (<$loci_file>) { chomp; # keep lines beginning with '#': if (/^#/) { print $out_file "$_\n"; next; # go immediately to next line in the loci file } # of the remaining lines, only keep those that have a 'true' entry + in the BED file hash: my ( $loci_chr, $loci_pos ) = split(/\s+/); # ignore extra colu +mns $loci_chr =~ s/\D(\d+)/$1/; # remove 'chr' (should it exist) if ( vec( $sticks{$loci_chr}, $loci_pos, 1 ) ) { print $out_file "$_\n"; } } [download]	[reply] [d/l]
Re: Best approach for large-scale data processing by davido (Cardinal) on Jul 13, 2012 at 17:24 UTC
Is it prohibitively expensive to pull your data set into a database and do SQL queries on it? Your 'loci' file could probably be placed directly into a table with no additions. Your region file would probably need an ID column that is unique, and auto-incrementing. You could then "SELECT min_pos, max_pos FROM region WHERE id=0000001;". Next, "SELECT * FROM loci WHERE pos < ? AND pos > ?;", and execute with your bind values of $min_pos and $max_pos. I usually get SQL wrong on the first try, so you would need to craft your own. But this type of approach gives you the full power of a relational database. The biggest problem would be the time it takes to populate the database. That may be an issue. Through the magic of DBI you can still enjoy the pleasure of using Perl in your solution. And through the deep magic of DBIx::Class and DBIx::Class::Schema::Loader, you could even get away with not having to worry about the SQL. If the dataset is short-lived enough that it proves to be prohibitively expensive to pull it all into a database, this entire post becomes just a filler between your question and the next post that provides a 100% Perl approach. ;) Dave	[reply]
Re: Best approach for large-scale data processing by frozenwithjoy (Priest) on Jul 13, 2012 at 17:49 UTC
Your aversion to a nested loop for this (together with the genomics question) makes it sound like you might use R, where nested loops are laughed at due to their inefficiency relative to applying functions across an entire array/matrix/etc.. If this is the case, you could use Statistics::R to easily interface with R since what you want to do is trivial in R. That being said, how long does it actually take you to do what you want using a nested loop? You could try to incorporate `map` which may give you modest savings vs a `for` loop.	[reply] [d/l] [select]