in reply to Perl regular expression for amino acid sequence

/[QGYN]{3,6}/ && !/(.)(?=\1\1)/

Or something like that. Instead of making it one regex, what's wrong with making it two regexes?

Being right, does not endow the right to be rude; politeness costs nothing.
Being unknowing, is not the same as being stupid.
Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Replies are listed 'Best First'.
Re^2: Perl regular expression for amino acid sequence
by seaver (Pilgrim) on Dec 01, 2004 at 20:01 UTC
    Ah, I should have explained more. I need to know what the found pattern was and I need to know if it's repeated which is why two regexp wont work I'm afraid. Here's my full code that includes your addition

    while ($seq{$k} =~ /([QGYN]{3,6})/g) { print "\n$k"; print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n"; }

    Thanks
    Sam

      Hi,

      Actually, I think you could use two regexs here:

      while ($seq{$k} =~ /([QGYN]{3,6})/g) { my $seq = $1; next if $seq =~ /(.)\1\1/; print "\n$k"; print "$seq begins at position ", (pos($seq{$k})-length($s)) , "\ +n"; }

      If this works for you, we could even optimize and consolidate this code a bit. I don't know where $s comes from, but I assume the lenght isn't changing any.

      my $length = length $s; # Pull this out of the loop for eff. my $sequence = $seq{$k}; while ($sequence =~ /([QGYN]{3,6})/g) { my $seq = $1; my $pos = $-[0] - $length; # @- holds the positions on the last m +atch next if $seq =~ /(.)\1\1/; print "\n$k $seq begins at position $pos\n"; }

      update: that was supposed to be print, not printf

      Note that this is untested...

      Ted Young

      ($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)
        The problem with using two regexes is: if it matches and then gets rejected, your pos counter is still incremented. Say you match a string of 6 chars, QGYNNN. What you want from that is QGYNN (right, OP?), but what happens is that all six characters get tossed.

        That brings me to a flaw in my proposed solution: it will only give you QGYN from the above input. Needs some tweaking.


        Caution: Contents may have been coded under pressure.

      I need to know what the found pattern was and I need to know if it's repeated which is why two regexp wont work I'm afraid.

      Two regexes will work just fine. Use the first to do coarse filtering, and the second to filter.

      while ($seq{$k} =~ /([QGYN]{3,6})/g) { next if $1 =~ m/QQQ|GGG|YYY|NNN/; print "\n$k"; print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n"; }
      This has the benefit of being blindingly obvious about what you're doing.

      Oops: ikegami is correct. This is blindingly wrong.

        It has the benefit of being blindingly unobvious about what it's not doing. It doesn't detect 'GNN', 'NNGYGY' and 'NNGNN' givn the input 'xxxxxxxGNNNxxxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx'

      This solution is actually fairly wrong since it first attempts to take from the front instead of trying to shorten the match. Of course, this is if QGNNNG would be considered series of two valid amino acids, being QGN and NNG.

      my $cur; while ($seq{$k} =~ /([QGYN]{3,6})/g) { $cur = $1; pos($seq{$k}) -= length($cur) - 1 and next if $cur =~ /(.)\1\1/; print "\n$k"; print $cur." begins at position ", (pos($seq{$k})-length($s)) , "\n +"; }
        The fix is something like:
        my $cur; while ($seq{$k} =~ /([QGYN]{3,6})/g) { $cur = $1; pos($seq{$k}) -= length($cur); $cur =~ s/(.)\1\1.*/$1$1/; if (length($cur) >= 3) { pos($seq{$k}) += length($cur); } else { ++pos($seq{$k}); next } print "\n$k"; print $cur." begins at position ", (pos($seq{$k})-length($s)) , "\n +"; }

        Caution: Contents may have been coded under pressure.