Re^2: Perl regular expression for amino acid sequence

Replies are listed 'Best First'.
Re^3: Perl regular expression for amino acid sequence by TedYoung (Deacon) on Dec 01, 2004 at 20:12 UTC
Hi, Actually, I think you could use two regexs here: `while ($seq{$k} =~ /([QGYN]{3,6})/g) { my $seq = $1; next if $seq =~ /(.)\1\1/; print "\n$k"; print "$seq begins at position ", (pos($seq{$k})-length($s)) , "\ +n"; }` [download] If this works for you, we could even optimize and consolidate this code a bit. I don't know where $s comes from, but I assume the lenght isn't changing any. `my $length = length $s; # Pull this out of the loop for eff. my $sequence = $seq{$k}; while ($sequence =~ /([QGYN]{3,6})/g) { my $seq = $1; my $pos = $-[0] - $length; # @- holds the positions on the last m +atch next if $seq =~ /(.)\1\1/; print "\n$k $seq begins at position $pos\n"; }` [download] update: that was supposed to be print, not printf Note that this is untested... Ted Young `($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)`	[reply] [d/l] [select]
Re^4: Perl regular expression for amino acid sequence by Roy Johnson (Monsignor) on Dec 01, 2004 at 20:17 UTC
The problem with using two regexes is: if it matches and then gets rejected, your pos counter is still incremented. Say you match a string of 6 chars, QGYNNN. What you want from that is QGYNN (right, OP?), but what happens is that all six characters get tossed. That brings me to a flaw in my proposed solution: it will only give you QGYN from the above input. Needs some tweaking. Caution: Contents may have been coded under pressure.	[reply]
Re^3: Perl regular expression for amino acid sequence by dws (Chancellor) on Dec 01, 2004 at 21:13 UTC
I need to know what the found pattern was and I need to know if it's repeated which is why two regexp wont work I'm afraid. Two regexes will work just fine. Use the first to do coarse filtering, and the second to filter. `while ($seq{$k} =~ /([QGYN]{3,6})/g) { next if $1 =~ m/QQQ\|GGG\|YYY\|NNN/; print "\n$k"; print $1." begins at position ", (pos($seq{$k})-length($s)) , "\n"; }` [download] This has the benefit of being blindingly obvious about what you're doing. Oops: ikegami is correct. This is blindingly wrong.	[reply] [d/l]
Re^4: Perl regular expression for amino acid sequence by ikegami (Patriarch) on Dec 01, 2004 at 21:25 UTC
It has the benefit of being blindingly unobvious about what it's not doing. It doesn't detect 'GNN', 'NNGYGY' and 'NNGNN' givn the input `'xxxxxxxGNNNxxxxxxxNNNGYGYxxxxxxxGYGYNNNxxxxxxxNNNGNNNxxxxxxx'`	[reply] [d/l]
Re^3: Perl regular expression for amino acid sequence by !1 (Hermit) on Dec 01, 2004 at 20:37 UTC
This solution is actually fairly wrong since it first attempts to take from the front instead of trying to shorten the match. Of course, this is if QGNNNG would be considered series of two valid amino acids, being QGN and NNG. `my $cur; while ($seq{$k} =~ /([QGYN]{3,6})/g) { $cur = $1; pos($seq{$k}) -= length($cur) - 1 and next if $cur =~ /(.)\1\1/; print "\n$k"; print $cur." begins at position ", (pos($seq{$k})-length($s)) , "\n +"; }` [download]	[reply] [d/l]
Re^4: Perl regular expression for amino acid sequence by Roy Johnson (Monsignor) on Dec 01, 2004 at 20:53 UTC
The fix is something like: `my $cur; while ($seq{$k} =~ /([QGYN]{3,6})/g) { $cur = $1; pos($seq{$k}) -= length($cur); $cur =~ s/(.)\1\1./$1$1/; if (length($cur) >= 3) { pos($seq{$k}) += length($cur); } else { ++pos($seq{$k}); next } print "\n$k"; print $cur." begins at position ", (pos($seq{$k})-length($s)) , "\n +"; }` [download] Caution:* Contents may have been coded under pressure.	[reply] [d/l]