Loading only portions of a file at any given time

TJCooper has asked for the wisdom of the Perl Monks concerning the following question:

I currently load a FASTA formatted file, which contains DNA sequence separated by headers denoted by '>' i.e.:

>header1
AGACATGATCGACTGGACACAATTTACGTAGCTG
>header2
AAATACTAGGGCAACACACACACACACCACACAC
>header3
AGTTAGTCCAGTAGTTTACAGTACGACTGATCGT
[download]

via bioPerl (Bio::SeqIO), and store the entire FASTA file into memory as a hash:

my $fasta = $ARGV[0];
my $seqio = Bio::SeqIO->new(-file => $fasta);
while(my $seqobj = $seqio->next_seq) {
    my $id  = $seqobj->display_id;
    my $seq = $seqobj->seq;
    $sequences{$id} = $seq;
}
open my $IN2, '<', $ARGV[1] or die "$!";
while (<$IN2>) {
    # Subsample the FASTA sequences
}
[download]

Or I could code it manually (which is likely quicker given the amount of overhead in Bio::SeqIO).

Based on a second input file, I then subsample the FASTA sequences and do some processing.

Each line of $IN2 includes the header from its associated FASTA sequence.

This works fine, and on modern computers the memory usage isn't too much of a problem even with larger genomes (Human etc.). Nevertheless, i'd like to take advantage of the fact that my second input file ($IN2) is always sorted by header and thus, for example, the first 5000 lines of it only need the FASTA sequence from >header1. This could potentially save me having to load the entire FASTA file into memory and instead only load the sequence that is currently being used.

I'm however having a tough time trying to figure out how to actually do this so any suggestions or code would be greatly appreciated.

Comment on Loading only portions of a file at any given time Select or Download Code

Replies are listed 'Best First'.
Re: Loading only portions of a file at any given time by 1nickt (Canon) on Nov 30, 2017 at 13:38 UTC
Hi, if you won't know which header section you need until you start parsing the second file, I'd think you'd have a hard time doing what you want efficiently on the fly. Can you simply break your FASTA file into separate files in advance, one per header? The way forward always starts with a minimal test.	[reply]
Re^2: Loading only portions of a file at any given time by TJCooper (Beadle) on Nov 30, 2017 at 13:43 UTC
Indeed, I found that to be the main issue when trying to think of a way to do this. I could definitely split the FASTA file beforehand, although I try to minimise the amount of steps required to run a script hence the question! Thanks for your input.	[reply]
Re: Loading only portions of a file at any given time by kcott (Archbishop) on Dec 01, 2017 at 05:54 UTC
G'day TJCooper, I've provided an example script below which shows some techniques that you might use for this. I've assumed your `IN2` file just contains headers (your description wasn't 100% clear on that). If you're going to be reusing `IN2`, you may want to consider running the first part of the script separately, and storing the `@all_order` data: a database, Storable, or other serialisation methods are all possible candidates (depends on your requirements and what you have available). #!/usr/bin/env perl use strict; use warnings; use autodie; use Data::Dump; my @all_order; { open my $in2_fh, '<', 'pm_1204589_IN2'; while (<$in2_fh>) { chomp; push @all_order, $_; } } for my $range ([0..0], [0..2], [1..3], [3..3]) { my %wanted = map { $_ => 1 } @all_order[@$range]; { open my $in1_fh, '<', 'pm_1204589_IN1'; local $/ = "\n>"; while (<$in1_fh>) { chomp; substr $_, 0, 1, '' if $. == 1; my ($head, $seq) = split; $wanted{$head} = $seq if $wanted{$head}; } } print "Range: @$range\n"; dd \%wanted; } [download] With these dummy files (yes, I know, the sequences are totally bogus — just used for demonstration purposes): `$ cat pm_1204589_IN1 >Z ABC >Y DEF >X GHI >W JKL` [download] `$ cat pm_1204589_IN2 W X Y Z` [download] I get this output: `Range: 0 { W => "JKL" } Range: 0 1 2 { W => "JKL", X => "GHI", Y => "DEF" } Range: 1 2 3 { X => "GHI", Y => "DEF", Z => "ABC" } Range: 3 { Z => "ABC" }` [download] See also: Data::Dump. — Ken	[reply] [d/l] [select]
Re: Loading only portions of a file at any given time by duff (Parson) on Nov 30, 2017 at 15:31 UTC
What TKCooper said, but ... you could also do something more complex like build an "index" and store it in another file and then use that index to determine which parts to load when. But, the cost of that method (in time and complexity and developing the thing that reads based on index and in maintenance) may out-weigh the benefit you gain. Though, if this is a commonish thing to want, maybe everyone who uses FASTA could benefit from a module that does this indexing? I dunno. duff	[reply]
Re^2: Loading only portions of a file at any given time by Anonymous Monk on Nov 30, 2017 at 16:18 UTC
Actually, a compromise suggestion might not be so bad ... Since the file is sorted, one would only need to read the file sequentially once, and note the binary `seek()` position where each sequence begins. (It ends where the next one begins.) You might even be able to keep that data in memory. Or, write it to a separate file that each script loads. Each script could now seek directly to the right spot, read the specified number of bytes, and do a couple of quick debugging-checks to make sure that what it just read looks okay. (i.e. that the index-file is not out of date.) Simple and easy to do, and it just might save a lot of time. (If the region is very-large, consider also File::Map.)	[reply]