regular expressions in unicode

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I need to write simple regular expressions such as "any word in all capital letters that is longer than 2 characters", which I would write as

/\b\U/w{2,}\E\b\/

This needs to work in Unicode (UTF8) also, and after glancing at perlretut, perlunicode etc I still have a number of questions:

- does \b work with Unicode "words"?
- what's the difference between \p{Lu}, \p{IsUpper} (if there is one)
- How would I rewrite the above regexp? Is this correct:

/\b\p{Lu}{2,}\b\/
I assume I don't need to use (\w\p{Lu}), since \w should be a subset of \p{Lu}?

Comment on regular expressions in unicode Select or Download Code

Replies are listed 'Best First'.
Re: regular expressions in unicode by Ieronim (Friar) on Jan 25, 2007 at 12:00 UTC
The word borders are tricky and certainly should not be used with Unicode data, as `/\b/` means a border between `\W` and `\w`, i.e. between `[A-Za-z0-9]` and `[^A-Za-z0-9]`. You must decide what you call a 'word'. In my opinion it is any alphanumeric sequence with apostrophes and hyphens, but your definition may differ. `#!/usr/bin/perl use warnings; use strict; my $test = "TEST TESt TE'st T 12TE"; while ($test =~ /(?<![\pL\pN\'-]) #NOT a hyphen, apostroph, lett +er or number before (\p{Lu}{2,}) # two or more uppercase letter +s (?![\pL\pN\'-]) #NOT a hyphen, apostroph, lett +er or number after /xg) { print "$1\n"; }` [download] My regex matches only "TEST" at the beginning of the string in the example. `s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print`	[reply] [d/l] [select]
Re: regular expressions in unicode by dave_the_m (Monsignor) on Jan 25, 2007 at 11:38 UTC
which I would write as `/\b\U/w{2,}\E\b\/` Um just as an aside, that doesn't do what you think it does (even ignoring the typos). It's equivalent to `/\b\W{2,}\b/` [download] Dave.	[reply] [d/l] [select]
Re^2: regular expressions in unicode by Anonymous Monk on Jan 25, 2007 at 15:22 UTC
Sorry for the typos. ... I see that, but why? I thought `\U` was supposed to be an escape sequence to convert the character sequence to uppercase. So I thought `\U\w{2,}` would only match uppercase characters. `\U[A-Za-z0-9_]{2,}` does just that; and I thought `\w` was a shortcut for that character class? (simply put, except for locale settings, or when Unicode is used etc) Thanks	[reply] [d/l] [select]
Re^3: regular expressions in unicode by dave_the_m (Monsignor) on Jan 25, 2007 at 22:08 UTC
I thought \U was supposed to be an escape sequence to convert the character sequence to uppercase. It is, but it applies to the pattern, not to the string being matched. It's most useful when the pattern contains an interpolated string, eg `$ perl -le '$s = "a"; print "\U$s"' A $ perl -le '$s = "a"; print "matched a" if "a" =~ "\U$s"' $ perl -le '$s = "a"; print "matched A" if "A" =~ "\U$s"' matched A` [download] Dave.	[reply] [d/l]