comment on

It's impossible to tell why your actual output is wrong and your expected output is right. Please read "How do I post a question effectively?" for guidelines on what information to provide.

There doesn't appear to be anything fundamentally wrong with you regex. However, you're giving the regex engine more work to do with the character class "[ATCG]": those are the only characters that will appear in a DNA sequence, so just "." would be sufficient (and the regex engine doesn't need check if every character exists in the provided list).

With the missing information in your post — particularly the absence of input data — it's difficult to know where you've gone wrong. I suspect concatenating all the sequences is probably a bad idea; instead, considering dealing with each sequence in turn.

The following script, "pm_1205271_bio_fasta_extract.pl", is intended to provide some techniques you may find useful.

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;

my ($fasta_in, $fasta_out) = qw{
    pm_1205271_bio_fasta_extract_dummy_in.fasta
    pm_1205271_bio_fasta_extract_dummy_out.fasta
};

my $re = qr[(.{10}GG)\z];

{
    open my $in_fh, '<', $fasta_in;
    open my $out_fh, '>', $fasta_out;

    local $/ = "\n>";

    while (<$in_fh>) {
        chomp;
        substr $_, 0, 1, '' if $. == 1;
        my ($head, $seq) = split /\n/;
        next unless $seq =~ $re;
        next if index($seq, $1) < length($seq) - length($1);
        print $out_fh ">$head\n$seq\n";
    }
}
[download]

Here's a sample run with some dummy data:

$ cat pm_1205271_bio_fasta_extract_dummy_in.fasta
>1: not wanted - too short
xxxGG
>2: wanted - exact match
xxxxxxxxxxGG
>3: not wanted - not unique
xxxxxxxxxxGGxxxxxxxxxxGG
>4: wanted - unique (but only just)
xxxxxxxxxGGxxxxxxxxxxGG
>5: not wanted - no GG at end
xxxxxxxxxxGGx
>6: not wanted - no match at all
xxxxxxxxxxAA
>7: wanted - match unique
yyyyyyyyyyGGxxxxxxxxxxGG
$ pm_1205271_bio_fasta_extract.pl
$ cat pm_1205271_bio_fasta_extract_dummy_out.fasta
>2: wanted - exact match
xxxxxxxxxxGG
>4: wanted - unique (but only just)
xxxxxxxxxGGxxxxxxxxxxGG
>7: wanted - match unique
yyyyyyyyyyGGxxxxxxxxxxGG
[download]

See also: autodie; perlre (you may want to follow the links at the start of that to introductory material); perlvar (for "$/" and "$."); and, Perl functions A-Z (for any functions you're unfamiliar with).

I'd also recommend you use lexical filehandles (see open); using package variables can result in bugs which are hard to track down (especially when you start writing code that's more substantial than a short assignment script).

— Ken

In reply to Re: unique sequences by kcott
in thread unique sequences by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.