Here's an approach that seems to satisfy the OPer's (somewhat vaguely expressed, and with the inferred qualifications noted by others, and including a sentence ending with a period) requirements (needs Perl 5.10 \K regex enhancement):
>perl -wMstrict -le
"my @phrases = (
'kinase i', 'hib', 'tor', 'tor SET6',
'SET6', 'p16(INK4A)', 'cell',
);
my $delim = qr{ \. | \A \s* | \s+ | \s* \z }xms;
my $phrase = join '|', reverse sort map quotemeta, @phrases;
my $mark = qq{\x23};
for my $s (@ARGV) {
print '--------------';
print $s;
$s =~ s{ $delim \K ($phrase) (?= $delim) }
{$mark$1$mark}xmsg;
print $s;
}
"
"cell kinase inhibitor SET6 activates p16(INK4A) in cell-wall tor SET6
+."
"kinase tor tor SET6" "tor tor SET6 kinase" "tor tor SET6"
"kinase tor tor SET6." "tor tor SET6 kinase." "tor tor SET6."
"kinase inhibitor" "kinase inhibitor."
--------------
cell kinase inhibitor SET6 activates p16(INK4A) in cell-wall tor SET6.
#cell# kinase inhibitor #SET6# activates #p16(INK4A)# in cell-wall #to
+r SET6#.
--------------
kinase tor tor SET6
kinase #tor# #tor SET6#
--------------
tor tor SET6 kinase
#tor# #tor SET6# kinase
--------------
tor tor SET6
#tor# #tor SET6#
--------------
kinase tor tor SET6.
kinase #tor# #tor SET6#.
--------------
tor tor SET6 kinase.
#tor# #tor SET6# kinase.
--------------
tor tor SET6.
#tor# #tor SET6#.
--------------
kinase inhibitor
kinase inhibitor
--------------
kinase inhibitor.
kinase inhibitor.
(Note: "\x23" is the "#" character. Have to do this because of a peculiarity of my command line 'editor'.)
If Perl version 5.10 is not available, use
s{ ($delim) ($phrase) (?= $delim) }{$1$mark$2$mark}xmsg;
as the substitution regex (tested).
Of course, a lot more testing is recommended!
The reverse in the
my $phrase = join '|',reverse sort map quotemeta, @phrases;
statement causes the ordered alternation to match the longest phrase substring.
See also Regexp::Assemble and related modules for other (and perhaps better) ways to compile the $phrase regex.
|