cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Good day monks. I'm trying to regularize references to U.S. government entities in some text. Taking the U.S. Army as an example:
$text = "U.S. Army\nthe Army has been berry-berry good to me\nUS-Army\ +nUS Army\n"; $text =~ s/\bArmy\b|U\.S\. Army/US-Army/g;
This yields the following
US-Army the US-Army has been berry-berry good to me US-US-Army US US-Army
The last two entries are obviously not what I want and they are happening because the dash counts as a boundary. This is even worse in the context of a loop like
while ($text =~ s/\bArmy\b|U\.S\. Army/US-Army/g) { print "$text\n\n"; }
which results in an infinite loop. So my question is: How can I get the dash to not count as a boundary character? (because of external constraints, it's not possible to use another joining character).

Many thanks in advance...

Steve

Replies are listed 'Best First'.
Re: Regexp dashes at boundaries
by brian_d_foy (Abbot) on Mar 17, 2005 at 22:42 UTC

    You don't always need alternation, especially if most of the stuff on either side is the same. You can use the ? quantifier to denote some parts as optional.

    This example works for the input you showed, but other forms you might find might require it to change a bit.

    #!/usr/bin/perl $text = <<"HERE"; U.S. Army the Army has been berry-berry good to me US-Army US Army HERE $text =~ s/(?:U.?S.?[\s-])*Army/US-Army/g; print $text;
    --
    brian d foy <bdfoy@cpan.org>
      Yes that does it. Now can I ask why you use the (?:...) extension? I tried it without and it works. I assume you have a good reason for putting it in there, but I looked up that regexp extension in the Camel book and the explanation is...um...sort of cryptic: "This is for clustering, not capturing; it groups subexpressions like ``()'', but doesn't make backreferences as ``()'' does." Thanks for the help... Steve

        I used the non-capturing parentheses to create a group so I could apply a quantifier to it. I want the "U.S. " to be optional, so I want to apply a * to that whole group. I should probably have used a ? (zero or one) though.

        --
        brian d foy <bdfoy@cpan.org>
Re: Regexp dashes at boundaries
by Roy Johnson (Monsignor) on Mar 18, 2005 at 12:47 UTC
    To answer your question: the boundary character is a zero-width assertion like (?:(?=\w)(?<=\W)|(?=\W)(?<=\w)) (next char is word and prev is non-word, or vice-versa). If you want a different definition of a boundary, you would need to roll your own boundary expression.

    Caution: Contents may have been coded under pressure.
Re: Regexp dashes at boundaries
by crashtest (Curate) on Mar 17, 2005 at 22:10 UTC
    Your regular expression doesn't do what you think it does (I think!). The alternation meta-character | works on the two entities directly next to it. You need to use parentheses with your alternation, maybe like this:
    $text =~ s/(\bArmy\b)|(U\.S\. Army)/US-Army/g;
    Update: Arghh! Completely disregard. Alternation extends further than I said unless you limit it with parentheses. I seem to be living in opposite land today.

    ... but it is cleaner to "factor out" the "Army" string and do something similar to:
    $text =~ s/(U\.S\.)?\bArmy/US-Army/g;
    Note: completely untested!

    Hope this helps.

      The alternation meta-character | works on the two entities directly next to it.

      Not in Perl:

      $t = 'abcd'; $t =~ s/bc|xy/pq/; print "$t\n"; ==> apqd
      If what you said were true, the match above would have failed, and the contents of $t would have remained unchanged.

      the lowliest monk

        and for possible further illumination:
        $t = 'abcd'; $t1 = 'ababxycdcd'; $t2 = 'ababxycdcd'; $t3 = 'ababxycdcd'; # same as $t2, which is identical to $t1 $t4 = 'ababxycdcd'; # still the same... $t =~ s/bc|xy/pq/; print "\$t is: $t\n"; $t1 =~ s/(ab|xy)/pq/; print "\$t1 is: $t1\n"; $t2 =~ s/(ab|xy)/pq/g; print "\$t2 is: $t2\n"; $t3 =~ s/(ab)|(xy)/pq/g; print "\$t3 is: $t3\n"; $t4 =~ s/((ab)|(xy))/pq\1/g; #capture regex -outer (); grouping () in +side print "\$t4 is: $t4\n"; =head1 OUT Output of C:\_perl\pl_test>perl 440581.pl $t is: apqd matched the 'b' $t1 is: pqabxycdcd parenthesdized the alt; now matches the FIRST +'ab' $t2 is: pqpqpqcdcd added /g (match globally) replaces both 'ab's +and the xy $t3 is: pqpqpqcdcd Shifting parens ==> no change, here $t4 is: pqabpqabpqxycdcd Capturing regex -- 3 'pq' pairs, each pair f +ol by 'ab' or 'xy' =cut
        additional example added
      I suppose you are right about the alternation error; however the suggest regexp yields:
      U.S. US-Army the US-Army has been berry-berry good to me US-US-Army US US-Army
      Which now does the first case wrong as well.
Re: Regexp dashes at boundaries
by sh1tn (Priest) on Mar 18, 2005 at 18:35 UTC
    while(my $line=<DATA>){ $_ .= $line } s/(u)\W?(s)\W?\s*\W?\s*(army)/uc($1.$2).'-'.ucfirst$3/gieo; print # STDOUT: # # US-Army # the US-Army has been # US-Army # USA Army . US-Army __DATA__ U.S. Army the US army has been US - Army USA Army . U-S- Army