G'day Peter Keystrokes,

It occurred to me that given you said "a pretty hefty file", and, in terms of biological data, that could easily mean tens or hundreds of megabytes, you may not want to keep adding data to an ID once the maximum size has been exceeded. There's no point in using up large amounts of memory when you don't need to nor performing pointless processing.

In the following script (pm_1190090_fasta_filter_length.pl), when an ID's value exceeds the maximum length, it is flagged (by setting its value to the empty string — which also gets rid of whatever unwanted data it previously held) and then completely ignored on all subsequent iterations. After the file has been completely read, all values less than the minimum length are removed in a single statement.

#!/usr/bin/env perl -l use strict; use warnings; use autodie; my ($min, $max) = (10, 15); my $re = qr{(?x: ^ > ( .+ ) $ )}; my (%seq, $id); { open my $fh, '<', 'pm_1190090.fa'; while (<$fh>) { if (/$re/) { $id = $1; } else { next if exists $seq{$id} and $seq{$id} eq ''; chomp; $seq{$id} .= $_; $seq{$id} = '' if length $seq{$id} > $max; } } } delete @seq{grep { length $seq{$_} < $min } keys %seq}; print "$_: $seq{$_}" for sort keys %seq;

I dummied up this test data (pm_1190090.fa):

>1 AAA >5 AAA >2 AAA >1 AAACCC >5 CCC >3 AAA >1 AAACCCGGG >3 CCCGGG >1 AAACCCGGGTTT >4 AAACCCGGGTTT >3 TTT >6 12345678901234 >7 123456789012345 >8 1234567890123456

Here's a quick breakdown of the IDs. The first five are intended to look like real data; the last three are not, their purpose is to test for off-by-one errors.

  1. Occurs 4 times. Each instance is less than the minimum. Total length: 30. EXCLUDE
  2. Occurs once. Total length: 3. EXCLUDE
  3. Occurs 3 times. Each instance is less than the minimum. Total length: 12. INCLUDE
  4. Occurs once. Total length: 12. INCLUDE
  5. Occurs 2 times. Each instance is less than the minimum. Total length: 6. EXCLUDE
  6. Occurs once. Total length: 14 (one less than max). INCLUDE
  7. Occurs once. Total length: 15 (equal to max). INCLUDE
  8. Occurs once. Total length: 16 (one more than max). EXCLUDE

Here's a sample run:

$ pm_1190090_fasta_filter_length.pl 3: AAACCCGGGTTT 4: AAACCCGGGTTT 6: 12345678901234 7: 123456789012345 $

Some additional notes:

Update (additional tests): After posting, I realised that I'd tested for off-by-one errors around "max" but not around "min". I added these lines to the input file:

>9 123456789 >10 1234567890 >11 12345678901

And correctly got these two additional lines in the output:

10: 1234567890 11: 12345678901

— Ken


In reply to Re: Picking out specific lengths from a set of hashes. by kcott
in thread Picking out specific lengths from a set of hashes. by Peter Keystrokes

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.