richardwfrancis has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I've often found excellent solutions on this site so I'm hoping I can get some help with this one.

I have a pattern and a string. I'd like to search the string for the pattern and where it matches return the string that matches and the start and end position in the string of the match. If there are multiple matches I'd like to return this data for all the matches.

I think I'm on to something with the following but I'm not quite there. Any suggestions would be very much appreciated.

#!/usr/bin/perl use strict; my $string = "CATINTHEHATWITHABAT"; my $regex = '\wAT'; my @matches = (); foreach my $match ($string =~ /($regex)/gi){ my $length = length($&); my $pos = length($`); my $start = $pos + 1; my $end = $pos + $length; my $hitpos = "$start-$end"; push @matches, "$match found at $hitpos "; } print "$_\n" foreach @matches;
Cheers, Richard
  • Comment on Match, Capture and get position of multiple patterns in the same string
  • Download Code

Replies are listed 'Best First'.
Re: Match, Capture and get position of multiple patterns in the same string
by moritz (Cardinal) on Nov 12, 2009 at 15:16 UTC
    I highly recommend reading perlvar about all those regex match variables, it's full of "hidden" gems.
    #!/usr/bin/perl use strict; use warnings; use 5.010; my $string = "CATINTHEHATWITHABAT"; my $regex = qr{\wAT}i; my @matches = (); while ($string =~ /($regex)/pg){ my $start = $-[0]; my $end = $+[0]; my $hitpos = "$start-$end"; say "${^MATCH} found at $hitpos"; }
    Perl 6 - links to (nearly) everything that is Perl 6.
Re: Match, Capture and get position of multiple patterns in the same string
by Fletch (Bishop) on Nov 12, 2009 at 15:14 UTC

    Probably @- and @+ (covered in perlvar) will get you what you want without your having to resort to computing them.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Match, Capture and get position of multiple patterns in the same string
by kennethk (Abbot) on Nov 12, 2009 at 15:13 UTC
    So very close. What you really mean to do is:

    #!/usr/bin/perl use strict; my $string = "CATINTHEHATWITHABAT"; my $regex = '\wAT'; my @matches = (); while ($string =~ /($regex)/gi){ my $match = $1; my $length = length($&); my $pos = length($`); my $start = $pos + 1; my $end = $pos + $length; my $hitpos = "$start-$end"; push @matches, "$match found at $hitpos "; } print "$_\n" foreach @matches;

    The difference is that a foreach loop builds the list before you start, whereas the while loop re-executes the expression each time. This means that you are clobbering $& and friends at the start of your foreach loop, but using a while loop means the values are fresh.

      Firstly, thanks to EVERYONE who has commented here. It's been a great help. I posted this before I went to bed and when I get to work this morning, you guys had solved all my problems! Wish it could be the same every day!

      I'll certainly check out the perlvar link as Fletch and moritz suggest. I always like a good gem!

      Kennethk, you know I tried a while loop cos I thought it should work but it kept hanging. I've since realised that this was down to doing something else in the loop that I didn't tell you guys about as it's a bit silly ;) . Basically I have the need to substitute the match within the string with a lowercase version of itself ie once the matches, positions etc are found the string will then go:

      From this: CATINTHEHATWITHABAT To this: catINTHEhatWITHAbat

      I was doing this simply as follows (adapted from the code moritz provided - thanks!):

      while ($string =~ /($regex)/pg){ my $match = ${^MATCH}; my $start = $-[0]; my $end = $+[0]; my $hitpos = "$start-$end"; my $lcmatch = lc($match); $string =~ s/$match/$lcmatch/g; push @matches, "$match found at $hitpos "; }

      However the substitution line seems to cause it to hang and I can't get my head around why as it should just be operating on the current match of which there are only 3 in this instance

      Can anyone make any suggestions other than storing each match string in the loop and doing the substitution separately outside the loop.

      Again, thanks for the help

      Regards, Richard
        The regex match inside the while condition stores its current position in pos($string) (see pos for detail), which the substitution resets, so on the next iteration it starts from the beginning again.

        So if you insist on doing the matching and substitution in two different steps, you have to manually set pos($string) after the substitution:

        use strict; use warnings; use 5.010; my $string = "CATINTHEHATWITHABAT"; my $regex = qr{\wAT}i; while ($string =~ m/($regex)/g){ my $match = $1; my $start = $-[0]; my $end = $+[0]; my $hitpos = "$start-$end"; my $lcmatch = lc($match); $string =~ s/$match/$lcmatch/g; # in the next iteration start over where we left off pos($string) = $end; say "$match found at $hitpos "; } say "string: $string";

        Since you put parenthesis around the regex, ${^MATCH} can be replaced by the shorter $1, and there's no need for the /p modifier.

        Perl 6 - links to (nearly) everything that is Perl 6.
        If you want to substitute at the same time as the matching, the following may work for you:

        #!/usr/bin/perl use strict; my $string = "CATINTHEHATWITHABAT"; my $regex = '\wAT'; my @matches = (); while ($string =~ s/($regex)/lc($1)/e){ my $match = $1; my $length = length($&); my $pos = length($`); my $start = $pos + 1; my $end = $pos + $length; my $hitpos = "$start-$end"; push @matches, "$match found at $hitpos "; } print "$_\n" foreach @matches; print "$string\n";

        I've used the e modifier (see perlretut) to evaluate the lower-case transliteration of the matched string.

        Caveat: This becomes an infinite loop if you use the i modifier, since you will continuously overwrite the first occurrence of 'cat'. If need case insensitivity, perhaps you'd want my $regex = '[A-Z][aA][tT]|[a-z][aA]T|[a-z]At'; or equivalent for your real case.

Re: Match, Capture and get position of multiple patterns in the same string
by JavaFan (Canon) on Nov 12, 2009 at 16:25 UTC
    Note that all solutions presented sofar will only return non-overlapping greedy matches, not all matches as the requirement is. Say for instance, you're searching for fofo in fofofo. I count two matches, but anything using while ("fofofo" =~ /fofo/g) will only find 1. And if the pattern is (?:fo)+ there are 6 potential matches (3 from position 0, 2 from position 2, 1 from position 4).

    Perhaps you are only interested in non-overlapping greedy matches, but that wasn't clear to me.

    $_ = "fofofo"; /((?:fo)+)(?{ say "[$-[1], $+[1], $1]" })(*FAIL)/; __END__ [0, 6, fofofo] [0, 4, fofo] [0, 2, fo] [2, 6, fofo] [2, 4, fo] [4, 6, fo]
Re: Match, Capture and get position of multiple patterns in the same string
by 7stud (Deacon) on Nov 12, 2009 at 15:50 UTC

    How about this

    use strict; use warnings; use 5.010; my $string = "CATATHAT"; my $regex = '\wAT'; say 0..9; #print ruler say $string, "\n"; #print original string while ($string =~ /$regex/gi) { my $match_len = length($&); my $start = length($`); my $end = $start + $match_len - 1; say "match: $& ", "start: $start ", "end: $end"; } --output:-- 0123456789 CATATHAT match: CAT start: 0 end: 2 match: HAT start: 5 end: 7
      Sorry, I refreshed the page after working on a solution, and it said 0 responses still.
Re: Match, Capture and get position of multiple patterns in the same string
by Marshall (Canon) on Nov 12, 2009 at 16:03 UTC
    I was a bit confused about the requirements, but I think this code does it.This code does something. I'm not sure if it does what you want.

    #!/usr/bin/perl -w use strict; my @strings = ("CATINTHEHATWITHABAT", "WERWERWAT134wAt"); my $regex = '\wAT'; foreach my $string (@strings) { while ( $string =~ m/($regex)/gi) { printf "%s found at pos %2d in %s\n", $1, pos($string)-length($1)+1, $string; } } __END__ Prints: CAT found at pos 1 in CATINTHEHATWITHABAT HAT found at pos 9 in CATINTHEHATWITHABAT BAT found at pos 17 in CATINTHEHATWITHABAT WAT found at pos 7 in WERWERWAT134wAt wAt found at pos 13 in WERWERWAT134wAt
    An update:
    #!/usr/bin/perl -w use strict; my @strings = ("CATINTHEHATWITHABAT", "WERWERWAT134wAtThatat", "WERWERWAT134wAtThattat"); my $regex = '\wAT'; foreach my $string (@strings) { while ( $string =~ m/($regex)/gi) { printf "%s found at pos %2d-%-2d in %s\n", $1, pos($string)-length($1)+1, pos($string), $string; } } __END__ Prints: CAT found at pos 1-3 in CATINTHEHATWITHABAT HAT found at pos 9-11 in CATINTHEHATWITHABAT BAT found at pos 17-19 in CATINTHEHATWITHABAT WAT found at pos 7-9 in WERWERWAT134wAtThatat wAt found at pos 13-15 in WERWERWAT134wAtThatat hat found at pos 17-19 in WERWERWAT134wAtThatat WAT found at pos 7-9 in WERWERWAT134wAtThattat wAt found at pos 13-15 in WERWERWAT134wAtThattat hat found at pos 17-19 in WERWERWAT134wAtThattat tat found at pos 20-22 in WERWERWAT134wAtThattat Note: ranges do not "overlap", see "...Thattat"
Re: Match, Capture and get position of multiple patterns in the same string
by johngg (Canon) on Nov 13, 2009 at 00:00 UTC

    Using a look-ahead will allow for overlapping patterns. However, @+ contains the same values as corresponding elements of @- when look-aheads are used so sums are necessary.

    $ perl -e ' > $str = q{CATINTHEHATWITHABATATINAAAT}; > $rex = qr{(?=([^A]A+T))}; > printf > qq{Found %s, length %d at offset %d to %d\n}, > $1, > length $1, > $-[ 0 ], > $-[ 0 ] + length( $1 ) - 1 > while $str =~ m{$rex}g;' Found CAT, length 3 at offset 0 to 2 Found HAT, length 3 at offset 8 to 10 Found BAT, length 3 at offset 16 to 18 Found TAT, length 3 at offset 18 to 20 Found NAAAT, length 5 at offset 22 to 26 $

    I hope this is of interest.

    Cheers,

    JohnGG