saintbrie has asked for the wisdom of the Perl Monks concerning the following question:

Perl returns the leftmost longest match when you use regular expressions. How do you get it to return the shortest match? e. g.
$text = "the dog and the bear are quick, but the quick fox is quicker" +; $re = "the\\b.*?quick\\b.*?fox"; print "it matches \n" if ($text =~ /($re)/sg); print $1 . "\n";
I just want it to match "the quick fox" (the spaces could be any gobbledygook at all, I'm parsing HTML and can't use HTML::Parser). What I get is "the dog and the bear are quick, but the quick fox"

Replies are listed 'Best First'.
Re: Getting the shortest match?
by Abigail-II (Bishop) on Jan 18, 2003 at 00:40 UTC
    You mean, of all possible matches, find the shortest? There's no such method in Perl. It would be a very costly operation; you'd basically need to backtrack through all possible matches and select the shortest.

    Abigail

Re: Getting the shortest match?
by runrig (Abbot) on Jan 18, 2003 at 01:09 UTC
    You might not be able to do this for all regexes, and like Abigail-II says, its a costly operation, but in your particular case, you could do something like this:
    my $str = "abc def ghi abc 123 junk"; my $short; while ($str =~ /abc.*?123/g) { $short = $& if ! defined $short or length($&) < length($short); pos($str) = $-[0]+1; } print "$short\n";
Re: Getting the shortest match?
by gjb (Vicar) on Jan 18, 2003 at 01:07 UTC

    As Abigail-II says, it would be expensive to get the shortest match, and the following code is expensive, but it does what you want for the examples.

    use strict; use warnings; my $re = qr/the\b.*?quick\b.*?fox/; while (<DATA>) { chomp($_); my $text = $_; my $match; while ($text =~ /($re)/) { $match = $1; $text = substr($match, 1); } print "it matches '$match'\n"; } __DATA__ the dog and the bear are quick, but the quick white fox is quicker the quick fox and the brown fox the quick raven and the quick fox the quick yellow fox and the quick fox
    Notice that it will return 'the quick yellow fox' in the last case since that's the first "shortest" match.

    Hope this helps, -gjb-

      It is possible that if I tune my 'gobbledygook' filter that this will work.