g man has asked for the wisdom of the Perl Monks concerning the following question:

The following is a excerpt of text file:
right lymph node lymph fluid
how would i match for the longest string first, then shorter string in the example of above i would want my program to print out lymph node as the tissue type in line 1 but lymph in line 2 i have a table of relevant terms to match to, but there are separate entries for lymph and lymph node

Originally posted as a Categorized Question.

  • Comment on How do I match for several strings, matching the longest stringfirst?
  • Download Code

Replies are listed 'Best First'.
Re: How do I match for several strings, matching the longest string first?
by btrott (Parson) on Apr 16, 2000 at 22:14 UTC
    I think a good solution would be to sort the terms that you're matching for and create a regexp of the strings in sorted order. Sort them so that the regexp tries to match the longest string first, then moves on down in length until it's trying to match the shortest one.

    Something like the following should work:

    my @terms = ('lymph', 'lymph node'); my @text = ('right lymph node', 'lymph fluid'); # create a regexp that will match the longest # string first and capture the string that matched my $words = '\b(' . join('|', sort { length $b <=> length $a } @terms) . ')\b'; for my $text (@text) { if ($text =~ /$words/) { print $text, ": matched => ", $1, "\n"; } }
Re: How do I match for several strings, matching the longest string first?
by chromatic (Archbishop) on Apr 17, 2000 at 01:42 UTC
    Another option is to arrange your search terms into a sorted list:
    my @terms = sort { length $b <=> length $a } ('lymph', 'lymph node'); my @text = ('right lymph node', 'lymph fluid'); my %results; foreach my $term (@terms) { $results{$term} = (grep /\b$term\b/, @text); # find matches @text = grep !/\b$term\b/, @text; # remove matches } foreach (keys %results) { print "$_:\t", $results{$_}, "\n"; }
    This is likely less expensive with more search terms than building a large regexp, but the grep unfound operation may not help.