ssc37 has asked for the wisdom of the Perl Monks concerning the following question:

Hi ,

After reading some documentations and browse internet, i didn't find a solution to my problem.
I've a group in a regex with some words i'm looking for in a string ( a kind of synonyms's list ).
My problem is , even if it is a synonyms's list , i need to know who's the longest one who was found.

In this exemple , my wish is to have "apple in a fridge" returned by the match instead of "an apple".
perl -e ' my $string = "an apple in the fridge"; $string =~ /(apple in the fridge|an apple)/; print $1 . "\n"; ' an apple

Do you have a suggestion to help me with this kind of situation ?

Thanks for your help,
Best regards,
  • Comment on Regex Match : Doesn't return the longest match when there's a common word present in the group
  • Download Code

Replies are listed 'Best First'.
Re: Regex Match : Doesn't return the longest match when there's a common word present in the group
by choroba (Cardinal) on Feb 01, 2015 at 17:27 UTC
    Perl tries to match the longest possible, but it tries to match as soon as possible first. "an apple" begins before "apple in the fridge", that's why it wins in this case.

    You can use the look-ahead assertion to search for overlapping strings in a loop:

    #! /usr/bin/perl use strict; use warnings; my $string = "an apple in the fridge"; my $match = q(); length $1 > length $match and $match = $1 while $string =~ /(?=(apple in the fridge|an apple))/g; print "$match\n";
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      That's exactly what i was looking for :) Thanks you
Re: Regex Match : Doesn't return the longest match when there's a common word present in the group
by LanX (Saint) on Feb 01, 2015 at 17:41 UTC
    I think you should rather solve your logic problem.

    Why does one entry have an article "an" and not all of them?

    Ie putting "an apple in the fridge" into your regex solves it, disallowing articles solves it too.

    IOW you're comparing apples with oranges (almost literally ;) ...

    edit
    Of course the alternatives should be sorted by length, but you seem to know this already...

    Cheers Rolf

    PS: Je suis Charlie!

Re: Regex Match : Doesn't return the longest match when there's a common word present in the group
by QM (Parson) on Feb 03, 2015 at 10:41 UTC
    First, order the alternatives by priority. If this is purely by length, put the longest one first.

    However, any alternative that starts earlier in the target string will be matched, regardless of a longer string later. Given multiple alternatives, you'll have to find all of the matches, and then pick the longest one.

    There are a couple of ways to do this. One is to get all matches for each alternative, then sort the results by length. Something like this (untested):

    my $string = "an apple in the fridge"; my @regexes = ("an apple", "apple in a fridge", qr/(?:pine)?apples/); my %matches; for my $regex (@regexes) { my @matches = $string =~ /$regex/; for my $match (@matches) { $matches{$match} = 1; # or increment to keep score } } my @matches_sorted_by_length = sort {length($a) <=> length($b)} keys % +matches; print "Longest match is $matches_sorted_by_length[-1]\n";

    However, this ignores the edge case where the same substring overlaps itself, like "hearth" in "hearthearth". If the regexes are not fixed length and/or have optional parts, some matches might be missed (including the longest one).

    The improvement to the above is to walk through the target string using pos to start matching just after the last match started. There are several examples you can find, which I'm too short of time to look up at the moment.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of