in reply to match longest sequence in an "or" RE

... match the longest "or" possible ...

The Perl regex engine matches the leftmost longest substring for a given pattern; leftmost trumps longest. If you want to match the longest substring that occurs anywhere in the string (which, IIRC, is the POSIX definition of matching), you have to jump through a hoop or two:

>perl -wMstrict -le "my $s = 'x s m e f s f pl s f x'; ;; $s =~ qr/^.+(s m e f|s f pl|s f).+$/; print qq{'$1'}; print qq{ -------}; ;; my $alt = qr{ s[ ]m[ ]e[ ]f | s[ ]f[ ]pl | s[ ]x }xms; for my $s ( 'x s m e f s f pl s f x', 'x s f s f pl s m e f x', 'x s f s m e f s f pl x', ) { my ($longest) = reverse sort { length($a) <=> length($b) } $s =~ m{ (?<= .) $alt (?= .) }xmsg ; print qq{'$longest'}; } " 's f' ------- 's m e f' 's m e f' 's m e f'

I think there is another, possibly more succinct, solution using the 5.10+ Special Backtracking Control Verbs, but I can't remember it ATM.

Update: Well, this doesn't use backtracking control verbs quite the way I imagined, but at least it gets rid of the list-making and sorting:

>perl -wMstrict -le "my $alt = qr{ s[ ]m[ ]e[ ]f | s[ ]f[ ]pl | s[ ]f }xms; my $rex = qr{ (?<= .) $alt (?= .) }xms; ;; for my $s ( 'x s m e f s f pl s f x', 'x s f s f pl s m e f x', 'x s f s m e f s f pl x', 'x s f s f x', 'x s s x', 's f', 's m e f', ) { local our $longest; local our $offset; use re 'eval'; $s =~ m{ ($rex) (?{ ($longest, $offset) = ($1, $-[1]) if ! defined($longest) || length($1) > length($longest) }) (*SKIP)(*FAIL) }xmsg; print qq{'$s' -> '$longest' at $offset} if defined $longest; } " 'x s m e f s f pl s f x' -> 's m e f' at 3 'x s f s f pl s m e f x' -> 's m e f' at 16 'x s f s m e f s f pl x' -> 's m e f' at 8 'x s f s f x' -> 's f' at 3

Note that in the string  'x  s f  s f  x' the left-most occurrence of 's f' is found. This can be changed to the right-most by changing the comparison operator in  length($1) > length($longest) to >= instead.

Replies are listed 'Best First'.
Re^2: match longest sequence in an "or" RE
by Corion (Patriarch) on Sep 24, 2011 at 19:41 UTC

    I think one of the problems is the greedy .+ at the start of the regex. It would be better to leave it off or make it nongreedy. Then, just sorting by length decreasing makes the thing work again:

    >perl -wMstrict -lE "say /^.+?(s m e f|s f pl|s f).+$/ for '-s m e f s + f pl s f-';" s m e f

      It's necessary to leave off the first  .+ entirely and sort all the captured alternations by length decreasing (or some similar approach). Just making the first  .+ non-greedy is not sufficient to find the longest match anywhere in the string:

      >perl -wMstrict -lE "say q{'}, /^.+?(s m e f|s f pl|s f).+$/, q{' at offset }, $-[1] for '-s f s f pl s m e f-'; " 's f' at offset 1
Re^2: match longest sequence in an "or" RE
by JavaFan (Canon) on Sep 24, 2011 at 22:04 UTC
    There is leftmost, and there is leftmost. Perl regular expressions do both. POSIX regular expressions do just one of them.

    Oh, which leftmosts are there? Well, there's the leftmost starting point in the target string, and there's the leftmost choice in alternation.

    POSIX regular expressions find the leftmost match in the target string, but from all possible matches from that starting point, it will return the longest.