in reply to Re: Perl regular expression for amino acid sequence
in thread Perl regular expression for amino acid sequence

Ah, I should have explained more. I need to know what the found pattern was and I need to know if it's repeated which is why two regexp wont work I'm afraid. Here's my full code that includes your addition

while ($seq{$k} =~ /([QGYN]{3,6})/g) { print "\n$k"; print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n"; }

Thanks
Sam

Replies are listed 'Best First'.
Re^3: Perl regular expression for amino acid sequence
by TedYoung (Deacon) on Dec 01, 2004 at 20:12 UTC

    Hi,

    Actually, I think you could use two regexs here:

    while ($seq{$k} =~ /([QGYN]{3,6})/g) { my $seq = $1; next if $seq =~ /(.)\1\1/; print "\n$k"; print "$seq begins at position ", (pos($seq{$k})-length($s)) , "\ +n"; }

    If this works for you, we could even optimize and consolidate this code a bit. I don't know where $s comes from, but I assume the lenght isn't changing any.

    my $length = length $s; # Pull this out of the loop for eff. my $sequence = $seq{$k}; while ($sequence =~ /([QGYN]{3,6})/g) { my $seq = $1; my $pos = $-[0] - $length; # @- holds the positions on the last m +atch next if $seq =~ /(.)\1\1/; print "\n$k $seq begins at position $pos\n"; }

    update: that was supposed to be print, not printf

    Note that this is untested...

    Ted Young

    ($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)
      The problem with using two regexes is: if it matches and then gets rejected, your pos counter is still incremented. Say you match a string of 6 chars, QGYNNN. What you want from that is QGYNN (right, OP?), but what happens is that all six characters get tossed.

      That brings me to a flaw in my proposed solution: it will only give you QGYN from the above input. Needs some tweaking.


      Caution: Contents may have been coded under pressure.
Re^3: Perl regular expression for amino acid sequence
by dws (Chancellor) on Dec 01, 2004 at 21:13 UTC

    I need to know what the found pattern was and I need to know if it's repeated which is why two regexp wont work I'm afraid.

    Two regexes will work just fine. Use the first to do coarse filtering, and the second to filter.

    while ($seq{$k} =~ /([QGYN]{3,6})/g) { next if $1 =~ m/QQQ|GGG|YYY|NNN/; print "\n$k"; print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n"; }
    This has the benefit of being blindingly obvious about what you're doing.

    Oops: ikegami is correct. This is blindingly wrong.

      It has the benefit of being blindingly unobvious about what it's not doing. It doesn't detect 'GNN', 'NNGYGY' and 'NNGNN' givn the input 'xxxxxxxGNNNxxxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx'

Re^3: Perl regular expression for amino acid sequence
by !1 (Hermit) on Dec 01, 2004 at 20:37 UTC

    This solution is actually fairly wrong since it first attempts to take from the front instead of trying to shorten the match. Of course, this is if QGNNNG would be considered series of two valid amino acids, being QGN and NNG.

    my $cur; while ($seq{$k} =~ /([QGYN]{3,6})/g) { $cur = $1; pos($seq{$k}) -= length($cur) - 1 and next if $cur =~ /(.)\1\1/; print "\n$k"; print $cur." begins at position ", (pos($seq{$k})-length($s)) , "\n +"; }
      The fix is something like:
      my $cur; while ($seq{$k} =~ /([QGYN]{3,6})/g) { $cur = $1; pos($seq{$k}) -= length($cur); $cur =~ s/(.)\1\1.*/$1$1/; if (length($cur) >= 3) { pos($seq{$k}) += length($cur); } else { ++pos($seq{$k}); next } print "\n$k"; print $cur." begins at position ", (pos($seq{$k})-length($s)) , "\n +"; }

      Caution: Contents may have been coded under pressure.