Matching terms in text

rajaman has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a long list of terms that I want to use to tag terms appearing in different paragraphs of large amounts of texts in the following way:

#terms
ID|related terms delimited by '|'
1|Tesla S|Tesla model V|Tesla model|Tesla
2|Ford RAM truck|2020 Ford Mustand|Ford ranger|Ranger|ford|2020 Ford F
+-250|2020 Ford F250|Ford F 250
3|GM Chevrolet|GM Chevy|GM Chevrolet 2020|Chevrolet|Chevy|Chevrolet 20
+20|GM Chevrolet volt|Chevrolet Captiva Sport|General motors|GM
.
.

Program input: ... 2020 Ford F 250 is great but I prefer Tesla Model-V
+ ...

Desired output: ... <2><2020 Ford F 250> is great but I prefer <1><Tes
+la model-V> ... #<2> and <1> are the IDs for the terms
[download]

I was trying to do something like shown below, but it does not work out.

$term1 = 'Tesla S|Tesla model V|Tesla model|TESLA';
$term2 = 'Ford RAM truck|2020 Ford Mustand|Ford ranger|Ranger|ford|202
+0 Ford F-250|2020 Ford F250|Ford F 250';
term1 =~ s/\-/ /g;
term2 =~ s/\-/ /g;
$input = '... 2020 Ford F 250 is great but I prefer Tesla Model-V ...'
+;
$input =~ s/\-/ /g;

$input =~ s/($term1)/\<1\>\<$1\>/ig;
$input =~ s/($term2)/\<2\>\<$1\>/ig;
.
.
[download]

Please suggest how to do this an efficient and fast way?

Thank you!

Comment on Matching terms in text Select or Download Code

Replies are listed 'Best First'.
Re: Matching terms in text by Corion (Patriarch) on Apr 23, 2020 at 21:05 UTC
Can you tell us how the approach you currently have does not work out? Maybe show us a small, self-contained working program that shows how things fail for you? From the look of it, your program should basically work, but maybe there is something I'm missing. Showing a runnable program and telling us how it fails to do what you want would help.	[reply]
Re: Matching terms in text by haukex (Archbishop) on Apr 23, 2020 at 21:25 UTC
Please see Building Regex Alternations Dynamically. use warnings; use strict; use Text::CSV; my %terms; my $csv = Text::CSV->new({ binary=>1, auto_diag=>2, sep_char=>"\|" }); my $fh = *DATA; # normally you'd open() a file here while ( my $row = $csv->getline($fh) ) { my $id = shift @$row; @terms{@$row} = ($id) x @$row; } $csv->eof or $csv->error_diag; my ($regex) = map {qr/$_/} join '\|', map {quotemeta} sort { length $b <=> length $a or $a cmp $b } keys %terms; my $input = q{2020 Ford F 250 is great but I prefer Tesla Model-V}; (my $output = $input) =~ s/($regex)/<$terms{$1}><$1>/g; use Test::More tests=>1; is $output, q{<2><2020 Ford F 250> is great but I prefer <1><Tesla Model-V>}; __DATA__ 1\|Tesla S\|Tesla model V\|Tesla model\|Tesla\|Tesla Model-V 2\|Ford RAM truck\|2020 Ford Mustand\|Ford ranger\|Ranger\|ford\|2020 Ford F +-250\|2020 Ford F 250\|Ford F 250 3\|GM Chevrolet\|GM Chevy\|GM Chevrolet 2020\|Chevrolet\|Chevy\|Chevrolet 20 +20\|GM Chevrolet volt\|Chevrolet Captiva Sport\|General motors\|GM [download]	[reply] [d/l]
Re: Matching terms in text by AnomalousMonk (Archbishop) on Apr 24, 2020 at 00:33 UTC
"It doesn't work" is seldom a useful problem description. After fixing `term1` and `term2` in the OPed code to be valid scalar identifiers, here's what I get: c:\@Work\Perl\monks>perl -w -le "$term1 = 'Tesla S\|Tesla model V\|Tesla model\|TESLA'; $term2 = 'Ford RAM truck\|2020 Ford Mustand\|Ford ranger\|Ranger\|ford\|20 +20 Ford F-250\|2020 Ford F250\|Ford F 250'; $term1 =~ s/\-/ /g; $term2 =~ s/\-/ /g; $input = '... 2020 Ford F 250 is great but I prefer Tesla Model-V ... + '; ;; $input =~ s/\-/ /g; ;; $input =~ s/($term1)/\<1\>\<$1\>/ig; $input =~ s/($term2)/\<2\>\<$1\>/ig; ;; print $input; " ... <2><2020 Ford F 250> is great but I prefer <1><Tesla Model V> ... [download] This is almost your desired output except that `'Model-V'` becomes `'Model V'` instead of `'model-V'`, which I hope you can live with. :) The root problem might have been more clear had you provided an SSCCE (see Short, Self-Contained, Correct Example) to show the code and the resulting error messages. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]