Re: Picking out specific lengths from a set of hashes.

G'day Peter Keystrokes,

It occurred to me that given you said "a pretty hefty file", and, in terms of biological data, that could easily mean tens or hundreds of megabytes, you may not want to keep adding data to an ID once the maximum size has been exceeded. There's no point in using up large amounts of memory when you don't need to nor performing pointless processing.

In the following script (pm_1190090_fasta_filter_length.pl), when an ID's value exceeds the maximum length, it is flagged (by setting its value to the empty string — which also gets rid of whatever unwanted data it previously held) and then completely ignored on all subsequent iterations. After the file has been completely read, all values less than the minimum length are removed in a single statement.

#!/usr/bin/env perl -l

use strict;
use warnings;
use autodie;

my ($min, $max) = (10, 15);
my $re = qr{(?x: ^ > ( .+ ) $ )};
my (%seq, $id);

{   
    open my $fh, '<', 'pm_1190090.fa';

    while (<$fh>) {
        if (/$re/) {
            $id = $1; 
        }   
        else {
            next if exists $seq{$id} and $seq{$id} eq '';
            chomp;
            $seq{$id} .= $_;
            $seq{$id} = '' if length $seq{$id} > $max;
        }   
    }   
}   

delete @seq{grep { length $seq{$_} < $min } keys %seq};
    
print "$_: $seq{$_}" for sort keys %seq;
[download]

I dummied up this test data (pm_1190090.fa):

>1
AAA
>5
AAA
>2
AAA
>1
AAACCC
>5
CCC
>3
AAA
>1
AAACCCGGG
>3
CCCGGG
>1
AAACCCGGGTTT
>4
AAACCCGGGTTT
>3
TTT
>6
12345678901234
>7
123456789012345
>8
1234567890123456
[download]

Here's a quick breakdown of the IDs. The first five are intended to look like real data; the last three are not, their purpose is to test for off-by-one errors.

Occurs 4 times. Each instance is less than the minimum. Total length: 30. EXCLUDE
Occurs once. Total length: 3. EXCLUDE
Occurs 3 times. Each instance is less than the minimum. Total length: 12. INCLUDE
Occurs once. Total length: 12. INCLUDE
Occurs 2 times. Each instance is less than the minimum. Total length: 6. EXCLUDE
Occurs once. Total length: 14 (one less than max). INCLUDE
Occurs once. Total length: 15 (equal to max). INCLUDE
Occurs once. Total length: 16 (one more than max). EXCLUDE

Here's a sample run:

$ pm_1190090_fasta_filter_length.pl
3: AAACCCGGGTTT
4: AAACCCGGGTTT
6: 12345678901234
7: 123456789012345
$
[download]

Some additional notes:

Using a globally-scoped, package variable for a filehandle is a bad choice: it has all the same problems as any global variable. Use a lexical filehandle with the 3-argument form of open. Doing this in limited scope (as I did) means you don't have to remember to close it (Perl does that for you). Leaving filehandles open until the last line of your script is also a bad choice: close them when you're finished with them.
Consider using the autodie pragma instead of hand-crafting you own I/O error messages (which are tedious to do and, because they're easy to forget and easy to get wrong or incomplete, error-prone).
Take a look at the sort documentation. Depending on the values of your IDs, you may need something more than I've shown; e.g. "sort { $a <=> $b } ..." for an ascending, numerical sort (when values don't all have the same number of digits).
See delete for examples of using a hash slice. See "perldata: Slices" if you don't know what that is.

Update (additional tests): After posting, I realised that I'd tested for off-by-one errors around "max" but not around "min". I added these lines to the input file:

>9
123456789
>10
1234567890
>11
12345678901
[download]

And correctly got these two additional lines in the output:

10: 1234567890
11: 12345678901
[download]

— Ken

Comment on Re: Picking out specific lengths from a set of hashes. Select or Download Code