Character class for French chars with accents in regex?

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Character class for French chars with accents in regex? by ikegami (Patriarch) on Aug 09, 2007 at 17:22 UTC
One way that won't break depending on the encoding of your source (.pl) file is: `use HTML::Entities qw( decode_entities ); # It is technically possible for Uuml to # be encountered in French, but I don't # know of any words that use it. my @french_accents = map decode_entities("&$_;"), map +($_, lc), qw( Acirc Agrave Eacute Ecirc Egrave Euml Icirc Iuml Ocirc Ugrave Uuml Ccedil AElig OElig ); my $french_accents = join '', @french_accents; $form_values{$field} =~ /[^A-Za-z0-9_'-\.\s$french_accents]/) {` [download] Tested. Note: Don't forget to decode the value placed in `$form_values{$field}`. Update: Added `Ocirc`. Should `Ucirc` be on that list? It's been so long since I've written in French.	[reply] [d/l] [select]
Re: Character class for French chars with accents in regex? by clinton (Priest) on Aug 09, 2007 at 18:27 UTC
If what you actually want to check is that it is a word character, rather than being specifically in the French alphabet, then you can convert your text to UTF8 and `\w` to match. From perlunicode: Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. \w can be used to match a Japanese ideograph, for instance. (However, and as a limitation of the current implementation, using \w or \W inside a ... character class will still match with byte semantics.) This helps when a user's name contains (eg) Ñ - it is still allowed even though it is not French. Clint	[reply] [d/l]
Re^2: Character class for French chars with accents in regex? by ikegami (Patriarch) on Aug 09, 2007 at 18:59 UTC
There are at least two downsides to that method worth mentioning. First, it allows similar looking characters to be used. For example, there's a cyrillic letter that looks almost identical to the latin 'a'. If the regexp is used to limit valid user names, it wouldn't stop one user from impersonating another by creating a similar looking user name. Secondly, it may allow characters that users have no easy way of entering into forms and characters that some/many users are unable to render. The severity of these downsides depends on the purpose of the regexp. Update: Here are some similar looking strings, but each is different: French Braid Frenсh Braid French Вraid French Brаid French Braіd	[reply]
Re^3: Character class for French chars with accents in regex? by clinton (Priest) on Aug 09, 2007 at 19:03 UTC
Fair points, both, and well mentioned. Depending on the application for this filter, these downsides may count for less than making your customers irate because they can't enter their names. Clint	[reply]
Re^4: Character class for French chars with accents in regex? by Anonymous Monk on Aug 10, 2007 at 11:44 UTC
Re: Character class for French chars with accents in regex? by jhourcle (Prior) on Aug 09, 2007 at 17:34 UTC
I've never had to do it, but from perlre, it seems like it's just a matter of using the proper locale: If the "utf8" pragma is not used but the "locale" pragma is, the classes correlate with the usual isalpha(3) interface (except for `word' and `blank'). and from perllocale : Here is a code snippet to tell what "word" characters are in the cur- rent locale, in that locale's order: `use locale; print +(sort grep /\w/, map { chr } 0..255), "\n";` [download] Compare this with the characters that you see and their order if you state explicitly that the locale should be ignored: `no locale; print +(sort grep /\w/, map { chr } 0..255), "\n";` [download]	[reply] [d/l] [select]