regexp on utf8 string

fredvdv has asked for the wisdom of the Perl Monks concerning the following question:

I want to make all first letters of each word uppercase and the rest of the word lowercase so I use this regexp:

$s = "word1 word2 word3";
$s =~ s/\b([A-Za-z]+)\b/\u\L$1/g
[download]

Everything goes well until $s contains utf8 characters then all characters following an utf8 one is also upper case. Is there a way to make \b aware of utf8 characters ? Is there another regexp to make the same processing but without messing up strings containing utf8 ? Regards, Frederic.

Comment on regexp on utf8 string Download Code

Replies are listed 'Best First'.
Re: regexp on utf8 string by tradez (Pilgrim) on Sep 10, 2004 at 15:21 UTC
Have you tried just using the function ucfirst as in `$newWord = ucfirst($oldword);` [download] Tradez "Never underestimate the predicability of stupidity" - Bullet Tooth Tony, Snatch (2001)	[reply] [d/l]
Re: regexp on utf8 string by borisz (Canon) on Sep 10, 2004 at 15:23 UTC
`s/\b(\p{Ll}\p{L}*)\b/\u\L$1/g;` [download] Boris	[reply] [d/l]
Re: regexp on utf8 string by davido (Cardinal) on Sep 11, 2004 at 05:02 UTC
The good thing about functions such as ucfirst is that they're locale-friendly. You don't really have to worry about what upper-case characters map to what lower-case characters in a given character set; ucfirst knows. Try this: `my $string = "word1 word2 word3"; $string = join ' ', map { ucfirst } split /\s+/, $string;` [download] It ought to do the trick. ...just one way to do it. A slightly more robust regexp solution might include the following: `use strict; use warnings; my $string = "word1 word2 word3"; $string =~ s/(\w+)(?=\W\|$)/ucfirst $1/eg;` [download] Dave	[reply] [d/l] [select]
Re^2: regexp on utf8 string by fredvdv (Novice) on Sep 12, 2004 at 16:31 UTC
I finally manage to do the tricks with these: uppercase the first letter and lowercase the rest `$string = "\u\L$string";` [download] my string always contains spaces and/or dashes so I uppercase each letters which follows one or more spaces or dashes. `$string =~ s/([\s-]+)(.)/$1\u$2/g;` [download] Regards, Frederic	[reply] [d/l] [select]
Re: regexp on utf8 string by trammell (Priest) on Sep 10, 2004 at 15:26 UTC
Perhaps you could work `ucfirst` into your solution?	[reply] [d/l]