Hello ScarletRoxanne,
Here’s an approach which replaces each phrase/term with a temporary marker, then removes stopwords, then replaces the markers with their original terms:
use strict; use warnings; use Const::Fast; use Data::Dump; const my $DELIM => '\034'; my %stops = map { lc $_ => 1 } qw( I am the of and you are ); my @terms = ('manager of sales', 'chairman of the board'); @terms = sort { length $b <=> length $a } @terms; # longest firs +t my $file3 = 'I am the Senior Manager of Sales and of Marketing. ' . 'You are the Chairman of the Board of Directors.'; $file3 =~ tr/A-Z/a-z/; # convert to lower case # replace terms with temporary markers $file3 =~ s{$terms[$_]}{$DELIM$_$DELIM}gi for 0 .. $#terms; my @file3 = split /\s+/, $file3; @file3 = grep { ! exists $stops{$_} } @file3; for my $entry (@file3) { if ($entry =~ /\Q$DELIM\E(\d+)\Q$DELIM\E/) { $entry = '*' . $terms[$1] . '*'; } else { $entry =~ s{[[:punct:]]}{}g; # remove punctuation } } print "$_\n" for @file3;
Output:
17:35 >perl 1997_SoPW.pl senior *manager of sales* marketing *chairman of the board* directors 17:35 >
Hope that helps,
| Athanasius <°(((>< contra mundum | Iustus alius egestas vitae, eros Piratica, |
In reply to Re: Match strings in order of character length, and remove the string from further processing
by Athanasius
in thread Match strings in order of character length, and remove the string from further processing
by ScarletRoxanne
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |