luthor has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

(first post at PerlMonks)

I have a regular expression to find sequence patterns in an alphanumeric string ($seq) and I would like to find out if it is possible (and how to do it) to track the positions (counting at zero at the first character of the string) of the pattern matches.

For example, if

$seq = "HHHHHTHHHHH55HHHH5H"

my REGEXP will find any number of 5's surrounded by any number of H's. In this case it will find a match at position 12 and one at position 18. But how can automate the task so I can get the position? I.e. I need the program to count from the beginning and every time it finds one (typically more than once), it will write out the positions of each one (in the order found).

My code is:
#!/usr/bin/perl -w # use strict; my $count = (); my $seq = " TTHTT SHHHHHHHHH55HHHHTT HHTHHHH5HHHHH "; while ($seq =~/H{1,}5{1,}H{1,}/g) { $count++; } print "Sequence: $seq\n"; print "Count: $count\n"; exit;

Thank you for reading and helping!

JP

Replies are listed 'Best First'.
Re: Pattern Matching with REGEXP: Is the match position trackable?
by belg4mit (Prior) on Sep 18, 2002 at 01:01 UTC
    while( /(?<=H)5+(?=H)/g ){ print pos; }

    I recommend perlre. Also, using a range with a lower limit of 1 and no upper limit is functionally equivalent to using + but more expensive.

    UPDATE: Note the use of a look-ahead for the second H, this is to allow the trailing H of one 5 to serve as the leading H of another 5.

    --
    perl -pew "s/\b;([mnst])/'$1/g"

Re: Pattern Matching with REGEXP: Is the match position trackable?
by jsprat (Curate) on Sep 18, 2002 at 01:25 UTC
    belg4mit's regex (++) is much cleaner than the one I wrote, so I won't post mine ;)

    However, since you wanted the position of the _first_ 5, I'd substitute @- (also known as @LAST_MATCH_START) for pos.

    # following line stolen shamelessly from belg4mit's post while( /(?<=H)5+(?=H)/g ){ print $-[0]; }

    Oh yeah, FYI @LAST_MATCH_START starts counting at zero, not one.

      One could also trap the 5+ and subtract the length from pos, esp. if worried about compatability as this is rather new (5.6) IIRC.

      --
      perl -pew "s/\b;([mnst])/'$1/g"

        First of all, thank you all very much for replying to my problem. I will start going through all of your suggestions/fixes and come back later today :)
Re: Pattern Matching with REGEXP: Is the match position trackable?
by sauoq (Abbot) on Sep 18, 2002 at 01:06 UTC

    If you want to know where the match began you can use pos($seq) - length($&) inside your while loop.

    I also suggest using /H+5+H+/ instead of what you have. The + quantifier means "1 or more" but is easier to read (and type) than {1,} is.

    Update: The anonymous reply below does well to point out the performance hit taken by using the $& special variable. A better approach would be to capture the whole regex with parens and then use pos($seq) - length($1) to obtain the position where the match began.

    -sauoq
    "My two cents aren't worth a dime.";
    
      This does work, but (from perldoc perlvar):
      $&   The string matched by the last successful pattern match (not
           counting any matches hidden within a BLOCK or eval() enclosed by
           the current BLOCK). (Mnemonic: like & in some editors.) This
           variable is read-only and dynamically scoped to the current
           BLOCK.
      
           The use of this variable anywhere in a program imposes a
           considerable performance penalty on all regular expression
           matches. See the BUGS manpage.