Re: unique sequences

It's impossible to tell why your actual output is wrong and your expected output is right. Please read "How do I post a question effectively?" for guidelines on what information to provide.

There doesn't appear to be anything fundamentally wrong with you regex. However, you're giving the regex engine more work to do with the character class "[ATCG]": those are the only characters that will appear in a DNA sequence, so just "." would be sufficient (and the regex engine doesn't need check if every character exists in the provided list).

With the missing information in your post — particularly the absence of input data — it's difficult to know where you've gone wrong. I suspect concatenating all the sequences is probably a bad idea; instead, considering dealing with each sequence in turn.

The following script, "pm_1205271_bio_fasta_extract.pl", is intended to provide some techniques you may find useful.

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;

my ($fasta_in, $fasta_out) = qw{
    pm_1205271_bio_fasta_extract_dummy_in.fasta
    pm_1205271_bio_fasta_extract_dummy_out.fasta
};

my $re = qr[(.{10}GG)\z];

{
    open my $in_fh, '<', $fasta_in;
    open my $out_fh, '>', $fasta_out;

    local $/ = "\n>";

    while (<$in_fh>) {
        chomp;
        substr $_, 0, 1, '' if $. == 1;
        my ($head, $seq) = split /\n/;
        next unless $seq =~ $re;
        next if index($seq, $1) < length($seq) - length($1);
        print $out_fh ">$head\n$seq\n";
    }
}
[download]

Here's a sample run with some dummy data:

$ cat pm_1205271_bio_fasta_extract_dummy_in.fasta
>1: not wanted - too short
xxxGG
>2: wanted - exact match
xxxxxxxxxxGG
>3: not wanted - not unique
xxxxxxxxxxGGxxxxxxxxxxGG
>4: wanted - unique (but only just)
xxxxxxxxxGGxxxxxxxxxxGG
>5: not wanted - no GG at end
xxxxxxxxxxGGx
>6: not wanted - no match at all
xxxxxxxxxxAA
>7: wanted - match unique
yyyyyyyyyyGGxxxxxxxxxxGG
$ pm_1205271_bio_fasta_extract.pl
$ cat pm_1205271_bio_fasta_extract_dummy_out.fasta
>2: wanted - exact match
xxxxxxxxxxGG
>4: wanted - unique (but only just)
xxxxxxxxxGGxxxxxxxxxxGG
>7: wanted - match unique
yyyyyyyyyyGGxxxxxxxxxxGG
[download]

See also: autodie; perlre (you may want to follow the links at the start of that to introductory material); perlvar (for "$/" and "$."); and, Perl functions A-Z (for any functions you're unfamiliar with).

I'd also recommend you use lexical filehandles (see open); using package variables can result in bugs which are hard to track down (especially when you start writing code that's more substantial than a short assignment script).

— Ken

Comment on Re: unique sequences Select or Download Code

Replies are listed 'Best First'.
Re^2: unique sequences by Dallaylaen (Chaplain) on Dec 11, 2017 at 17:02 UTC
I'd say the best place for `[AGCT]` regex check is as close as possible to reading the input file. The less unvalidated data inside the perimeter, the better.	[reply]
Re^3: unique sequences by kcott (Archbishop) on Dec 12, 2017 at 01:35 UTC
In general, I'd agree with that. However, biological data can be huge and the least amount of processing you can get away with, the better. I'd probably recommend validating the fasta file once, then setting it to "read-only". Perhaps additional checks to ensure it hasn't changed since validation might be in order. It can then be used multiple times with some reasonable degree of confidence about the data integrity. Obviously, at this point, we don't know the source of the input, or even what it looks like, so validation requirements are purely guesswork. — Ken	[reply]