seaver has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I'm searching a whole list of sequences of letters for a specific pattern. My problem is that the pattern isn't very specific by nature:

$hydphb='GAFIVL'; $polar='DEHAW'; $charged='DGYAH'; $_ =~ /[$polar.$charged]([$hydphb]|[$polar]| H[$hydphb][$polar][$hydphb]W){18-24}[$polar.$charged]/;
So basically, Im looking for a sequence bound by [$polar.$charged] with 18-24 letters in it, only 4 of the letters can be $polar, and this motif must be there too:

H[$hydphb][$polar][$hydphb]W.

The letters in the motif are included in the total number of letters, and the $polar in the motif must be included in the 4 $polar limit.

I was thinking about doing it in two passes:

$_ =~ /[$polar.$charged] ([$hydphb]|[$polar]){18-24} [$polar.$charged]/; $_ =~ /H[$hydphb][$polar][$hydphb]W/;
If the sequence matches both times, and I count the number of polar to be 4 then I can return true.

What do you think?

Cheers
Sam

Replies are listed 'Best First'.
Re: Searching for character classes of various quantities AND a motif
by davido (Cardinal) on Oct 08, 2003 at 19:23 UTC
    Since the regexp engine doesn't really have an "and" feature that provides layering, your solution is decent. I do have one comment though. I'm not sure if you're aware, but the '.' (dot) inside of the character classes is going to be taken literally, as a character. In other words:
    [$polar.$charged]
    Means any single character of the following set:
    D E H A W . D G Y A H
    I put the dot in there as a character, not an operator.
    Inside a character class (and inside quoted strings in general) you don't use the dot as a concatenation operator. The strings are interpolated and concatenated all at once, and the dot is a literal character, in such contexts.


    Dave


    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein
Re: Searching for character classes of various quantities AND a motif
by BrowserUk (Patriarch) on Oct 08, 2003 at 21:06 UTC

    If I've understood the rules, I *think* this would do the job, but it is a compicated thing to generate testcases for. Do you have any?

    #! perl -slw use strict; my $polar = 'DEHAW'; # Uniq E my $charged = 'DGYAH'; # Uniq Y my $hydphb = 'GAFIVL'; # Uniq FIVL my $re_seq = qr[ # Capture ( # start char [$polar$charged] # Must contain motif (?= .* H[$hydphb][$polar][$hydphb]W .* ) # Mustn't contain more than 4 polar chars (?! (?: .* [$polar] ){5} ) # 18-24 polar|hydphb chars # Excludes the start and end chars (Adjust if wrong!) [$polar$hydphb]{18,24} # # end char [$polar$charged] ) ]x; while( <DATA> ) { chomp; print "\nTesting '$_'"; s[\s+#.*$][]; print "matched: '$1'" while m[$re_seq]g; } __DATA__ YVVVVVVVVVVHVEVWVVY # 1 too short YVVVVVVVVVVHVEVWVVVY # min match YVVVVVVVVVVHVEVWVVVVVVVVVY # max match YVVVVVVVVVVHVEVWVVVVVVVVVVY # 1 too long YVVVVVVVVVVHVEVWVVVVVVVVEY # max polar YVVVVVVVVVVHVEVWVVVVVVVEEY # too many polar YVVVVVVVVVVHVYVWVVVY # missing motif

    Results

    P:\test>junk Testing 'YVVVVVVVVVVHVEVWVVY # 1 too short' Testing 'YVVVVVVVVVVHVEVWVVVY # min match' matched: 'YVVVVVVVVVVHVEVWVVVY' Testing 'YVVVVVVVVVVHVEVWVVVVVVVVVY # max match' matched: 'YVVVVVVVVVVHVEVWVVVVVVVVVY' Testing 'YVVVVVVVVVVHVEVWVVVVVVVVVVY # 1 too long' Testing 'YVVVVVVVVVVHVEVWVVVVVVVVEY # max polar' matched: 'YVVVVVVVVVVHVEVWVVVVVVVVEY' Testing 'YVVVVVVVVVVHVEVWVVVVVVVEEY # too many pol +ar' Testing 'YVVVVVVVVVVHVYVWVVVY # missing moti +f'

    I make no predictions about efficiency:)


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail