Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am stuck at a matching problem. I have an array which contains a list of terms and their translations, and I want to check if a word is contained in a string of text. I loop my array, and if a match is found I print out the translation. This is an example of my array:

my @dic= ( ['animal source food','aliment d’origine animale'], ['balanced diet','régime alimentaire équilibré'], ['food','aliment'], ['nutrition','nutrition'], ['nutrition assessment','évaluation de l’état nutritionnel'] );

I perform my matching with the following basic script (being $string my string of text:

foreach my $row (@glossary){ my $TermDBSource=@$row[0]; my $TermDBTarget=@$row[1]; if ($string=~ /$TermDBSource/){ print "$TermDBTarget\n"; last; } }

So far so good. My requirement is to match the longest n-gram. As I use a loop, the first match found in my array will be displayed. This could be a problem in a case where $string='this is a nutrition assessment' as it will match first 'nutrition' and not 'nutrition assessment' as desired. If I take out last it will match both 'nutrition' and 'nutrition assessment', again no good. The only idea I came up with is to order my array alphabetically (easy to do) and by length of n-gram (need to look at how to solve it, but I found https://perlmonks.com/?node_id=1219394 which may help me).

My question is more on the approach. Do you think this is the right approach, or I am complicated things too much and there are other more straightforward methods to achieve my goal?

Replies are listed 'Best First'.
Re: Matching wirth ordered array
by choroba (Cardinal) on Mar 18, 2019 at 22:52 UTC
    Sorting by the length starting from the longest is a common way. Instead of looping over the dictionary, you can build a regex from all the terms, and return the translation based on the matching part. Also, storing the dictionary in a hash makes retrieving the translation easier and faster. Something like the following:
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use utf8; use open 'OUT', ':encoding(UTF-8)', ':std'; my %dictionary = ( 'animal source food' => 'aliment d’origine animale', 'balanced diet' => 'régime alimentaire équilibré', 'food' => 'aliment', 'nutrition' => 'nutrition', 'nutrition assessment' => 'évaluation de l’état nutritionnel' ); my $text = 'nutrition and animal source food or nutrition assessment'; my $regex = join '|', map quotemeta, sort { length $b <=> length $a } keys %dictionary; while ($text =~ /($regex)/g) { say $dictionary{$1}; }

    Output:

    nutrition aliment d’origine animale évaluation de l’état nutritionnel

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Matching wirth ordered array
by BillKSmith (Monsignor) on Mar 19, 2019 at 03:43 UTC
    Another approach is to remember the length of the longest match "so far" and the corresponding target.
    use strict; use warnings; my $string = 'this is a nutrition assessment'; my @dic= ( ['animal source food','aliment d’origine animale'], ['balanced diet','régime alimentaire équilibré'], ['food','aliment'], ['nutrition','nutrition'], ['nutrition assessment','évaluation de l’état nutritionnel'] ); my $longest = 0; my $best_target; foreach my $row (@dic){ my ($TermDBSource, $TermDBTarget) = @$row; my $len; if ($string =~ m/($TermDBSource)/ and ($len = length($1)) > $longe +st) { $longest = $len; $best_target = $TermDBTarget; } } print $best_target, "\n"; OUTPUT: &#920;valuation de lÆ&#920;tat nutritionnel

    UPDATE: Modified code to be more consistent with original post.

    Bill
Re: Matching wirth ordered array
by AnomalousMonk (Archbishop) on Mar 19, 2019 at 01:03 UTC
Re: Matching wirth ordered array
by AnomalousMonk (Archbishop) on Mar 19, 2019 at 09:35 UTC

    Here's an elaboration on choroba's approach. The idea is to make matching insensitive to case and to variations in embedded whitespace. Note that introducing pattern matching into the substitution keys comes with potential pitfalls, so a robust test suite should probably accompany this solution.

    Output:
    c:\@Work\Perl\monks\Anonymous Monk\1231409>perl dynamic_match_2.pl before: [[this is a nutrition assessment for animal source food needed for a Balanced Diet of FOOD for proper nutrition. ]] after: [[this is a <<evaluation de l'etat nutritionnel>> for <<aliment + d'origine animale>> needed for a <<regime alimentaire equilibre>> of <<aliment>> for proper <<diet>>. ]]
    (Tested under Perl version 5.8.9.)


    Give a man a fish:  <%-{-{-{-<