rajaman has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a long list of terms that I want to use to tag terms appearing in different paragraphs of large amounts of texts in the following way:

#terms ID|related terms delimited by '|' 1|Tesla S|Tesla model V|Tesla model|Tesla 2|Ford RAM truck|2020 Ford Mustand|Ford ranger|Ranger|ford|2020 Ford F +-250|2020 Ford F250|Ford F 250 3|GM Chevrolet|GM Chevy|GM Chevrolet 2020|Chevrolet|Chevy|Chevrolet 20 +20|GM Chevrolet volt|Chevrolet Captiva Sport|General motors|GM . . Program input: ... 2020 Ford F 250 is great but I prefer Tesla Model-V + ... Desired output: ... <2><2020 Ford F 250> is great but I prefer <1><Tes +la model-V> ... #<2> and <1> are the IDs for the terms

I was trying to do something like shown below, but it does not work out.
$term1 = 'Tesla S|Tesla model V|Tesla model|TESLA'; $term2 = 'Ford RAM truck|2020 Ford Mustand|Ford ranger|Ranger|ford|202 +0 Ford F-250|2020 Ford F250|Ford F 250'; term1 =~ s/\-/ /g; term2 =~ s/\-/ /g; $input = '... 2020 Ford F 250 is great but I prefer Tesla Model-V ...' +; $input =~ s/\-/ /g; $input =~ s/($term1)/\<1\>\<$1\>/ig; $input =~ s/($term2)/\<2\>\<$1\>/ig; . .
Please suggest how to do this an efficient and fast way?

Thank you!

Replies are listed 'Best First'.
Re: Matching terms in text
by Corion (Patriarch) on Apr 23, 2020 at 21:05 UTC

    Can you tell us how the approach you currently have does not work out?

    Maybe show us a small, self-contained working program that shows how things fail for you?

    From the look of it, your program should basically work, but maybe there is something I'm missing. Showing a runnable program and telling us how it fails to do what you want would help.

Re: Matching terms in text
by haukex (Archbishop) on Apr 23, 2020 at 21:25 UTC

    Please see Building Regex Alternations Dynamically.

    use warnings; use strict; use Text::CSV; my %terms; my $csv = Text::CSV->new({ binary=>1, auto_diag=>2, sep_char=>"|" }); my $fh = *DATA; # normally you'd open() a file here while ( my $row = $csv->getline($fh) ) { my $id = shift @$row; @terms{@$row} = ($id) x @$row; } $csv->eof or $csv->error_diag; my ($regex) = map {qr/$_/} join '|', map {quotemeta} sort { length $b <=> length $a or $a cmp $b } keys %terms; my $input = q{2020 Ford F 250 is great but I prefer Tesla Model-V}; (my $output = $input) =~ s/($regex)/<$terms{$1}><$1>/g; use Test::More tests=>1; is $output, q{<2><2020 Ford F 250> is great but I prefer <1><Tesla Model-V>}; __DATA__ 1|Tesla S|Tesla model V|Tesla model|Tesla|Tesla Model-V 2|Ford RAM truck|2020 Ford Mustand|Ford ranger|Ranger|ford|2020 Ford F +-250|2020 Ford F 250|Ford F 250 3|GM Chevrolet|GM Chevy|GM Chevrolet 2020|Chevrolet|Chevy|Chevrolet 20 +20|GM Chevrolet volt|Chevrolet Captiva Sport|General motors|GM
Re: Matching terms in text
by AnomalousMonk (Archbishop) on Apr 24, 2020 at 00:33 UTC

    "It doesn't work" is seldom a useful problem description. After fixing  term1 and  term2 in the OPed code to be valid scalar identifiers, here's what I get:

    c:\@Work\Perl\monks>perl -w -le "$term1 = 'Tesla S|Tesla model V|Tesla model|TESLA'; $term2 = 'Ford RAM truck|2020 Ford Mustand|Ford ranger|Ranger|ford|20 +20 Ford F-250|2020 Ford F250|Ford F 250'; $term1 =~ s/\-/ /g; $term2 =~ s/\-/ /g; $input = '... 2020 Ford F 250 is great but I prefer Tesla Model-V ... + '; ;; $input =~ s/\-/ /g; ;; $input =~ s/($term1)/\<1\>\<$1\>/ig; $input =~ s/($term2)/\<2\>\<$1\>/ig; ;; print $input; " ... <2><2020 Ford F 250> is great but I prefer <1><Tesla Model V> ...
    This is almost your desired output except that 'Model-V' becomes 'Model V' instead of 'model-V', which I hope you can live with. :)

    The root problem might have been more clear had you provided an SSCCE (see Short, Self-Contained, Correct Example) to show the code and the resulting error messages.


    Give a man a fish:  <%-{-{-{-<