in reply to phrase match

Thank you all Monks for your comments.

Here is my reworked solution, it seems to work on my sample sentences, but I am not really sure if it will work in all types of situations. If you see any glitch with this solution, please let me know.

Thanks once again.

my @phrases = ( 'kinase i', 'hib', 'tor', 'tor SET6', 'SET6', 'p16(INK +4A)', 'cell', 'MAP', 'H1 inhibitor' ); my $sentence='kinase inhibitor SET6 MAP H1 inhibitor activates p16(INK +4A) in cell-wall'; for (my $i=0;$i<=$#phrases;$i++) { $phrases[$i]=~s/\s+/ /g; } my $phrases_re = join '|', map { quotemeta } @phrases; $sentence=~s/\s+/ /g; $sentence=' '.$sentence.' '; $sentence =~ s/\s($phrases_re)\s/ \#$1\# /g; $sentence=~s/\s+/ /g; $se +ntence =~ s/^\s+|\s+$//g; print "$sentence\n"; #Output: 'kinase inhibitor #SET6# #MAP# #H1 inhibitor# activates #p16( +INK4A)# in cell-wall'

Replies are listed 'Best First'.
Re^2: phrase match
by AnomalousMonk (Archbishop) on Dec 13, 2009 at 22:13 UTC

    I'm not sure I understand why the poor sentence must be mauled so relentlessly in your final approach, but it's fine with me if it works for you.

    I note that the approach you use does not seem to take account of longest versus shortest matches:  'tor SET6' can never match because  'tor' precedes it in the ordered alternation. Perhaps this is your intent, but be aware that as the code stands, longest-shortest matching behavior depends on the order in which phrases appear in the phrase list. (This is touched on in paragraph 5 of Re^3: phrase match.) See example below.

    I also note there is still no provision for a 'sentence' ending in a period, although again, perhaps this contingency will never arise. Example also below.

    >perl -wMstrict -le "my @phrases = ( 'kinase i', 'hib', 'tor', 'tor SET6', 'SET6', 'p16(INK4A)', 'cell', 'MAP', 'H1 inhibitor', 'foo bar', 'foo', 'bar', ); for (my $i=0;$i<=$#phrases;$i++) { $phrases[$i]=~s/\s+/ /g; } my $phrases_re = join '|', map { quotemeta } @phrases; for my $sentence (@ARGV) { print '------------------'; print $sentence; $sentence=~s/\s+/ /g; $sentence=' '.$sentence.' '; $sentence =~ s/\s($phrases_re)\s/ \x23$1\x23 /g; $sentence=~s/\s+/ /g; $sentence =~ s/^\s+|\s+$//g; print $sentence; } " "kinase inhibitor SET6 MAP H1 inhibitor activates p16(INK4A) in cell-w +all" "tor tor SET6 SET6" "SET6 tor SET6" "tor tor SET6 SET6." "foo bar bar" "foo foo bar bar" ------------------ kinase inhibitor SET6 MAP H1 inhibitor activates p16(INK4A) in cell-wa +ll kinase inhibitor #SET6# #MAP# #H1 inhibitor# activates #p16(INK4A)# in + cell-wall ------------------ tor tor SET6 SET6 #tor# #tor# #SET6# #SET6# ------------------ SET6 tor SET6 #SET6# #tor# #SET6# ------------------ tor tor SET6 SET6. #tor# #tor# #SET6# SET6. ------------------ foo bar bar #foo bar# #bar# ------------------ foo foo bar bar #foo# #foo bar# #bar#

      All very good points AnomalousMonk!

      >I'm not sure I understand why the poor sentence must be mauled so relentlessly in your final approach, but it's fine with me if it works for you.

      I was just experimenting a few more things.

      >I note that the approach you use does not seem to take account of longest versus shortest matches: 'tor SET6' can never match because 'tor' precedes it in the ordered alternation. Perhaps this is your intent, but be aware that as the code stands, longest-shortest matching behavior depends on the order in which phrases appear in the phrase list. (This is touched on in paragraph 5 of Re^3: phrase match.) See example below.

      Very good point. Yes, my 'phrase list' would be in the decreasing order of phrase string length.

      >I also note there is still no provision for a 'sentence' ending in a period, although again, perhaps this contingency will never arise. Example also below.

      Yes, I will have the period removed in a preprocessing step.

      Thanks a lot again.

        I was just experimenting a few more things.

        Experimentation is good!

        ... my 'phrase list' would be in the decreasing order of phrase string length.

        If you do a  reverse sort on the phrase list or array the job is done, and you don't have to worry any more about order of insertion when adding new phrases. See example in Re^3: phrase match, para 5.