in reply to Re: phrase match
in thread phrase match

This fixed version won't work, it gives the error Variable length lookbehind not implemented in regex.

Replies are listed 'Best First'.
Re^3: phrase match
by johngg (Canon) on Dec 13, 2009 at 14:11 UTC

    A way around that is to use an alternation of look-behinds ...

    qr/(?x) (?: (?<= \s ) | (?<= ^ ) ) ( $phrases_re ) (?= \s | $ )/

    ... although it is debateable whether this is clearer than your suggestions. In general I prefer look-arounds to replacing text with unaltered captures but that's just me.

    Cheers,

    JohnGG

Re^3: phrase match
by AnomalousMonk (Archbishop) on Dec 13, 2009 at 12:20 UTC
    An effective variable length look-behind is available in Perl 5.10 with the  \K special escape. The following compiles
        my $rx = qr/(?:^| )\K($phrases_re)(?= |$)/;
    but whether it serves the OPer's true needs is another question.

      That is useful sometimes, but here it's not needed, because a lookahead is enough.

      Run this:

      use warnings; $sentence='kinase inhibitor SET6 activates p16(INK4A) in cell-wall.'; my @phrases = ('kinase i', 'inhibitor', 'tor SET6', 'SET6', 'p16(INK4A +)', 'cell'); my $phrases_re = join '|', map { quotemeta } @phrases; $sentence =~ s/(^| )($phrases_re)(?= |$)/$1#$2#/g; print $sentence, "\n";

      You get the output

      kinase #inhibitor# #SET6# activates #p16(INK4A)# in cell-wall.

      Update: There are ways to do this kind of thing without lookaheads or lookbehinds, just as a curiosity. Replace the substitution statement above with either

      $sentence =~ s/(^| )($phrases_re)( |$)/$1#$2#$3/g for 0, 1;
      or
      use 5.010; given ($sentence) { s/ / /g; s/(^| )($phrases_re)( |$)/$1# +$2#$3/g; s/ / /g; }

      Update: One more alternative is below.

      my %phrase; $phrase{$_}++ for @phrases; my @sentence = split /( +)/, $sentence; for (@sentence) { $phrase{$_} and $_ = "#" . $_ . "#"; }; $sentence = join "", @sentence;

      Update: Oh, let's not forget this one either.

      $sentence =~ s/(?<![^ ])($phrases_re)(?= |$)/#$1#/g;

        Thanks for pointing out the error in my ‘fixed’ code!

        $sentence =~ s/(^| )($phrases_re)( |$)/$1#$2#$3/g for 0, 1;

        I wanted to point out a non-error in your correction above, since it took me a minute to understand what its purpose was: If you just did the global replacement without the for modifier, then you'd have the same problem that Crackers2 pointed out with my original, that overlapping matches wouldn't be handled (because the leading space of the trailing match would already have been gobbled up by the trailing space of the leading space). If I'm understanding correctly, then the for 0, 1 is just making another pass to pick up any matches that we missed this way.