cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Good day monks. I am celebrating Labor Day weekend by laboring :-(

I am trying to process some text to consistently tokenize references to people. I am using the following regexp to do this for Condoleeza Rice:

$text =~ s/(national security adviser )?(dr. |doctor )?(condoleeza )?r +ice/condoleezarice/ig;
The problem is that this will match the substring "rice" in any old word. I guess I could fix this with anchors or a brute-force enumeration of options. But I'm wondering if there is a more elegant way to rewrite this regexp so the second and third terms function as a non-exclusive or. I other words (whether or not the title is in front of her name) it should match only these cases
  1. dr. rice
  2. doctor rice
  3. dr. condoleeza rice
  4. doctor condoleeza rice
  5. condoleeza rice
I'm having trouble puzzling it out. The following expression
((dr. |doctor )|(condoleeza ))rice
doesn't match the first word of items 3 or 4 because the alternation operator | is functioning as an exclusive or. Your advice appreciated.

TIA....

Steve

Replies are listed 'Best First'.
Re: Non-exclusive or in regexp?
by Sidhekin (Priest) on Sep 06, 2004 at 00:48 UTC

    My first thought was:

    $text =~ s/(?:national security adviser |dr\. |doctor |condoleeza )+ri +ce/condoleezarice/ig;

    But that is not what you want. You want exact. So ... we need to get down to the tricky stuff ... ;-)

    $text =~ s/(?!rice) (?:dr\.\ |doctor\ )? (?:condoleeza\ )? rice /condoleezarice/igx;

    I just love trickery! :-)

    print "Just another Perl ${\(trickster and hacker)},"
    The Sidhekin proves Sidhe did it!

      This is a very neat trick. It took a few moments to work out how it works ++

      cheers

      tachyon

      Well it works but I'm clearly way out-monked here because I don't have a clue how! Can someone enlighten me?
        Well it works but I'm clearly way out-monked here because I don't have a clue how! Can someone enlighten me?

        The trick lies in the zero-width assertion: It will fail if none of the following optionals match anything. (It asserts no rice, but if none of the optionals match, rice is what follows.)

        I guess my answer was rather on the minimalist side. So, here is a commented version. But I won't just comment the same version I made first -- what's the fun in that? This time I'll use a positive lookahead zero-width assertion, not a negative. And I'll borrow the variables from tachyon's answer, including the job description that I somehow forgot first time around.

        So, here goes:

        # Note, tachyon, we need the /i modifier on each of these as well: my $job_desc = qr/\b national\s+security\s+adviser \s+/ix; my $title = qr/\b (?:dr\.|doctor) \s+/ix; my $f_name = qr/\b condoleeza \s+/ix; my $surname = qr/\b rice \b/ix; my $repl = 'condoleezarice'; $text =~ s/# Zero-width assertion: One of the options must follow: (?= $job_desc | $title | $f_name ) $job_desc ? # Option 1: job description $title ? # Option 2: title $f_name ? # Option 3: first name $surname # Surname -- not optional /$repl/gx;

        (If the surname could match any of the optionals, this would not work. Neither would the negative lookahead, though it would fail differently. But that is a problem I'll face no sooner than I need to.)

        Update: Turns out there is an implied grouping with qr//, so I have removed the explicit grouping from my code.

        print "Just another Perl ${\(trickster and hacker)},"
        The Sidhekin proves Sidhe did it!

Re: Non-exclusive or in regexp?
by ikegami (Patriarch) on Sep 06, 2004 at 00:39 UTC
    $dr = qr/(?:dr\.|doctor)/; $gn = qr/(?:condoleeza)/; $sn = qr/(?:rice)/; / $dr \s+ $sn | $dr \s+ $gn \s+ $sn | $gn \s+ $sn /x

    which can be simplified to:

    $dr = qr/(?:dr\.|doctor)/; $gn = qr/(?:condoleeza)/; $sn = qr/(?:rice)/; /(?:$dr (?:\s+ $gn)? | $gn) \s+ $sn/x

    Don't forget to escape the . in dr. to dr\.!

Re: Non-exclusive or in regexp?
by tachyon (Chancellor) on Sep 06, 2004 at 00:55 UTC

    You can do something like this to get your desired results Effectively we only do the substitution if we have a match in $2 as well as $1 which means we have to have "SOMETHING rice" where SOMETHING is the maximum non exclusive OR of the presented cases.....

    my $jdsc = qr/(?:national\s+security\s+adviser\s+)/; my $title = qr/(?:dr.|doctor)\s+/; my $fn = qr/(?:condoleeza\s+)/; my $sn = qr/rice/; my $repl = 'condoleezarice'; while(<DATA>){ s{ ( ( $jdsc? $title? $fn? ) $sn ) } { $2 ? $repl : $1 }exig; print; } __DATA__ white rice dr. rice doctor rice dr. condoleeza rice doctor condoleeza rice condoleeza rice national security adviser doctor condoleeza rice

    Update, I prefer Sidhekins method

    s/ (?!$sn) $jdsc? $title? $fn? $sn /$repl/igx;

    cheers

    tachyon

Re: Non-exclusive or in regexp?
by Eimi Metamorphoumai (Deacon) on Sep 06, 2004 at 00:46 UTC
    Another alternative, one that gives false positives, but not on realistic texts, and is pretty simple, would be
    /(?:dr\. |doctor | condoleeza )+rice\b/i
    It would also match unlikely things like "condoleeza dr. rice", but it'll make sure that at least one of them, and as many as possible, is there.
Re: Non-exclusive or in regexp?
by Zaxo (Archbishop) on Sep 06, 2004 at 00:39 UTC
    . . . in any old word

    You can force that to only match as a whole word with the word boundary assertion \brice\b. Otherwise, your regex looks about right to me.

    After Compline,
    Zaxo

Re: Non-exclusive or in regexp?
by sintadil (Pilgrim) on Sep 06, 2004 at 00:38 UTC
Re: Non-exclusive or in regexp?
by Anonymous Monk on Sep 06, 2004 at 12:08 UTC
    s{((national security adviser\s+)?(dr.\s+|doctor\s+)?(condoleeza\s+)?) (?(?{length $1})|(?!))rice}{condoleezarice}igx