in reply to Perl regular expression for amino acid sequence

Here's one way, which needs a slightly convoluted way of figuring out the original positiion:
# break up three character repeats, inserting spaces while ($seq{$k} =~ s/([QGYN])\1\1/$1$1 $1$1/g) { } while ($seq{$k} =~ m/([QGYN]{3,6})/g) { print "Match: $1 at ", pos($seq{$k}) - length($1)-2*(substr($seq{$k}, 0, pos($seq{$k})) =~ tr/ / /), + "\n"; }
If you already have spaces in your sequences, you'd have to use some other character.

Updated: Changed 5 to 6. I thought the original had a "5", but it was just the tiny fonts on my monitor.

Replies are listed 'Best First'.
Re^2: Perl regular expression for amino acid sequence
by ikegami (Patriarch) on Dec 01, 2004 at 21:21 UTC
    >perl script.pl Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 Match: GNN at 7 ...

    It seems my Perl's tr/// clears pos for all strings. Workaround:

    use strict; use warnings; my %seq; my $k = 0; $seq{$k} = 'xxxxxxxGNNNxxxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxx +xx'; # break up three character repeats, inserting spaces while ($seq{$k} =~ s/([QGYN])\1\1/$1$1 $1$1/g) { } while ($seq{$k} =~ m/([QGYN]{3,5})/g) { my $saved_pos = pos($seq{$k}); printf("Match: %s at %d\n", $1, pos($seq{$k}) - length($1)-2*(substr($seq{$k}, 0, pos($seq{$k})) + =~ tr/ / /), ); pos($seq{$k}) = $saved_pos; }

    Finally, a solution that works!