Peter Keystrokes has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have a script which hashes the sequences of a pretty hefty file whereby the hash is the DNA sequence and the key is the sequence ID. I am now trying to cycle through my hashes using a foreach loop in an attempt to sift out sequences which have a specific length. This length will be set using a variable for minimum and maximum length at the user's discretion. I am having problems trying to accomplish this task though. I'm not sure how to return the character length of the hash. In order to pick my desired sequences. My script so far is as follows:
open F, "human_hg19_circRNAs_putative_spliced_sequence.fa", or die $!; my %seq = (); my $id = ''; while (<F>){ chomp; if ($_ =~ /^>(.+)/){ $_ = $1; }else{ $seq{$id} .= $_; } } my @seqarray; my $maxlength = 150; my $minlength = 100; foreach $id (keys %seq){ if ((length$seq{$id} <= $maxlength) || (length$seq{$id} >= $minlen +gth)){ @seqarray = ($id."\n",$seq{$id}); } } print join ("\n", @seqarray); close F;

Replies are listed 'Best First'.
Re: Picking out specific lengths from a set of hashes.
by huck (Prior) on May 11, 2017 at 18:36 UTC

    In doing @seqarray = ($id."\n",$seq{$id}); you are resetting seqarray to a single member no matter what is in there already. I suspect you want to push something onto seqarray. I would push $id, then later use that to get at the contents of $seq{$id} as well.

    foreach $id (keys %seq){ if ((length ($seq{$id}) <= $maxlength) || (length($seq{$id}) >= $m +inlength)){ push @seqarray,$id; } } for my $id (@seqarray){ print $id."\n".$seq{$id}."\n"; }

      Thank you. What's the reason for redeclaring the $id variable in the final 'for' loop? Edit: I'm also wondering whether the length function works on a hash such as $seq{$id} in the same way it works on say something like $id?

        What's the reason for redeclaring the $id variable in the final 'for' loop?

        Habit, keeping things local to where they are used rather than global in scope.

        I'm also wondering whether the length function works on a hash such as $seq{$id} in the same way it works on say something like $id?

        to length either is just yet another string

        I'm also wondering whether the length function works on a hash such as $seq{$id} in the same way it works on say something like $id?

        $seq{$id} is not a hash, but a value, and it's length can be determined, as with any string or number or byte sequence. In Perl speak, the word hash denotes the whole structure of an associative array holding key/value pairs. A hash is minced meat, and that's what's done to the key of a tuple: it is transformed via a hashing function into something that can be used as index into an associative array. So, a perl "hash" is an "unordered collection of scalar values indexed by their associated string key (perldata) which for lookup purposes is minced through a hashing function". For entries (key/value pairs) in such a collection (hash entry) the string key is often referred to as hash key and the value as hash value.

        This may sound pedantic and as nitpicking (sorry about that), but people understand each other better if they agree on the meaning of terms.

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
      I think && instead of || is needed here? To be within the range from min to max, number must be <= max and >= min.
Re: Picking out specific lengths from a set of hashes.
by kcott (Archbishop) on May 11, 2017 at 22:45 UTC

    G'day Peter Keystrokes,

    It occurred to me that given you said "a pretty hefty file", and, in terms of biological data, that could easily mean tens or hundreds of megabytes, you may not want to keep adding data to an ID once the maximum size has been exceeded. There's no point in using up large amounts of memory when you don't need to nor performing pointless processing.

    In the following script (pm_1190090_fasta_filter_length.pl), when an ID's value exceeds the maximum length, it is flagged (by setting its value to the empty string — which also gets rid of whatever unwanted data it previously held) and then completely ignored on all subsequent iterations. After the file has been completely read, all values less than the minimum length are removed in a single statement.

    #!/usr/bin/env perl -l use strict; use warnings; use autodie; my ($min, $max) = (10, 15); my $re = qr{(?x: ^ > ( .+ ) $ )}; my (%seq, $id); { open my $fh, '<', 'pm_1190090.fa'; while (<$fh>) { if (/$re/) { $id = $1; } else { next if exists $seq{$id} and $seq{$id} eq ''; chomp; $seq{$id} .= $_; $seq{$id} = '' if length $seq{$id} > $max; } } } delete @seq{grep { length $seq{$_} < $min } keys %seq}; print "$_: $seq{$_}" for sort keys %seq;

    I dummied up this test data (pm_1190090.fa):

    >1 AAA >5 AAA >2 AAA >1 AAACCC >5 CCC >3 AAA >1 AAACCCGGG >3 CCCGGG >1 AAACCCGGGTTT >4 AAACCCGGGTTT >3 TTT >6 12345678901234 >7 123456789012345 >8 1234567890123456

    Here's a quick breakdown of the IDs. The first five are intended to look like real data; the last three are not, their purpose is to test for off-by-one errors.

    1. Occurs 4 times. Each instance is less than the minimum. Total length: 30. EXCLUDE
    2. Occurs once. Total length: 3. EXCLUDE
    3. Occurs 3 times. Each instance is less than the minimum. Total length: 12. INCLUDE
    4. Occurs once. Total length: 12. INCLUDE
    5. Occurs 2 times. Each instance is less than the minimum. Total length: 6. EXCLUDE
    6. Occurs once. Total length: 14 (one less than max). INCLUDE
    7. Occurs once. Total length: 15 (equal to max). INCLUDE
    8. Occurs once. Total length: 16 (one more than max). EXCLUDE

    Here's a sample run:

    $ pm_1190090_fasta_filter_length.pl 3: AAACCCGGGTTT 4: AAACCCGGGTTT 6: 12345678901234 7: 123456789012345 $

    Some additional notes:

    • Using a globally-scoped, package variable for a filehandle is a bad choice: it has all the same problems as any global variable. Use a lexical filehandle with the 3-argument form of open. Doing this in limited scope (as I did) means you don't have to remember to close it (Perl does that for you). Leaving filehandles open until the last line of your script is also a bad choice: close them when you're finished with them.
    • Consider using the autodie pragma instead of hand-crafting you own I/O error messages (which are tedious to do and, because they're easy to forget and easy to get wrong or incomplete, error-prone).
    • Take a look at the sort documentation. Depending on the values of your IDs, you may need something more than I've shown; e.g. "sort { $a <=> $b } ..." for an ascending, numerical sort (when values don't all have the same number of digits).
    • See delete for examples of using a hash slice. See "perldata: Slices" if you don't know what that is.

    Update (additional tests): After posting, I realised that I'd tested for off-by-one errors around "max" but not around "min". I added these lines to the input file:

    >9 123456789 >10 1234567890 >11 12345678901

    And correctly got these two additional lines in the output:

    10: 1234567890 11: 12345678901

    — Ken

Re: Picking out specific lengths from a set of hashes.
by dave_the_m (Monsignor) on May 11, 2017 at 20:00 UTC
    When you read in the file, you never set an id. I suspect that $_ = $1 should in fact be $id = $1.

    Dave.

Re: Picking out specific lengths from a set of hashes.
by marinersk (Priest) on May 12, 2017 at 06:13 UTC

    Also, just a quick note on terminology:

    my %myhash = (); $myhash{'ABC'} = 10;
    • key: The 'ABC'portion of $myhash{'ABC'}
    • value: The value stored in the hash by that key; in this case, 10
    • hash: The whole structure %myhash

    When you use the term 'hash' to mean the 'value', it can cause confusion when someone is reading your description. I find it helpful to use distinct terminology for each element or aspect of a thing, if possible.