Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I need to write simple regular expressions such as "any word in all capital letters that is longer than 2 characters", which I would write as

/\b\U/w{2,}\E\b\/

This needs to work in Unicode (UTF8) also, and after glancing at perlretut, perlunicode etc I still have a number of questions:

- does \b work with Unicode "words"?
- what's the difference between \p{Lu}, \p{IsUpper} (if there is one)
- How would I rewrite the above regexp? Is this correct:

/\b\p{Lu}{2,}\b\/
I assume I don't need to use (\w\p{Lu}), since \w should be a subset of \p{Lu}?

Replies are listed 'Best First'.
Re: regular expressions in unicode
by Ieronim (Friar) on Jan 25, 2007 at 12:00 UTC
    The word borders are tricky and certainly should not be used with Unicode data, as /\b/ means a border between \W and \w, i.e. between [A-Za-z0-9] and [^A-Za-z0-9].

    You must decide what you call a 'word'. In my opinion it is any alphanumeric sequence with apostrophes and hyphens, but your definition may differ.

    #!/usr/bin/perl use warnings; use strict; my $test = "TEST TESt TE'st T 12TE"; while ($test =~ /(?<![\pL\pN\'-]) #NOT a hyphen, apostroph, lett +er or number before (\p{Lu}{2,}) # two or more uppercase letter +s (?![\pL\pN\'-]) #NOT a hyphen, apostroph, lett +er or number after /xg) { print "$1\n"; }
    My regex matches only "TEST" at the beginning of the string in the example.

         s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print
Re: regular expressions in unicode
by dave_the_m (Monsignor) on Jan 25, 2007 at 11:38 UTC
    which I would write as /\b\U/w{2,}\E\b\/

    Um just as an aside, that doesn't do what you think it does (even ignoring the typos). It's equivalent to

    /\b\W{2,}\b/
    Dave.
      Sorry for the typos.
      ... I see that, but why? I thought \U was supposed to be an escape sequence to convert the character sequence to uppercase. So I thought \U\w{2,} would only match uppercase characters. \U[A-Za-z0-9_]{2,} does just that; and I thought \w was a shortcut for that character class? (simply put, except for locale settings, or when Unicode is used etc)

      Thanks
        I thought \U was supposed to be an escape sequence to convert the character sequence to uppercase.
        It is, but it applies to the pattern, not to the string being matched. It's most useful when the pattern contains an interpolated string, eg
        $ perl -le '$s = "a"; print "\U$s"' A $ perl -le '$s = "a"; print "matched a" if "a" =~ "\U$s"' $ perl -le '$s = "a"; print "matched A" if "A" =~ "\U$s"' matched A

        Dave.