punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Massive Monks,

Checking form input with a positive-option check (that is, only this specified list of characters is allowed, if anything not in this list is present, disallow input), like this:

if ($form_values{$field} =~ /[^A-Za-z0-9_'-\.\s]/) { disallow; } else { do stuff; }
How can I define a character class that includes all possible combinatins of letters and French accents?

Thanks.




Forget that fear of gravity,
Get a little savagery in your life.

Replies are listed 'Best First'.
Re: Character class for French chars with accents in regex?
by ikegami (Patriarch) on Aug 09, 2007 at 17:22 UTC

    One way that won't break depending on the encoding of your source (.pl) file is:

    use HTML::Entities qw( decode_entities ); # It is technically possible for Uuml to # be encountered in French, but I don't # know of any words that use it. my @french_accents = map decode_entities("&$_;"), map +($_, lc), qw( Acirc Agrave Eacute Ecirc Egrave Euml Icirc Iuml Ocirc Ugrave Uuml Ccedil AElig OElig ); my $french_accents = join '', @french_accents; $form_values{$field} =~ /[^A-Za-z0-9_'-\.\s$french_accents]/) {

    Tested.

    Note: Don't forget to decode the value placed in $form_values{$field}.

    Update: Added Ocirc. Should Ucirc be on that list? It's been so long since I've written in French.

Re: Character class for French chars with accents in regex?
by clinton (Priest) on Aug 09, 2007 at 18:27 UTC
    If what you actually want to check is that it is a word character, rather than being specifically in the French alphabet, then you can convert your text to UTF8 and \w to match.

    From perlunicode:

    Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. \w can be used to match a Japanese ideograph, for instance.

    (However, and as a limitation of the current implementation, using \w or \W inside a ... character class will still match with byte semantics.)

    This helps when a user's name contains (eg) Ñ - it is still allowed even though it is not French.

    Clint

      There are at least two downsides to that method worth mentioning.

      First, it allows similar looking characters to be used. For example, there's a cyrillic letter that looks almost identical to the latin 'a'. If the regexp is used to limit valid user names, it wouldn't stop one user from impersonating another by creating a similar looking user name.

      Secondly, it may allow characters that users have no easy way of entering into forms and characters that some/many users are unable to render.

      The severity of these downsides depends on the purpose of the regexp.

      Update: Here are some similar looking strings, but each is different:

      • French Braid
      • Frenсh Braid
      • French Вraid
      • French Brаid
      • French Braіd
        Fair points, both, and well mentioned. Depending on the application for this filter, these downsides may count for less than making your customers irate because they can't enter their names.

        Clint

Re: Character class for French chars with accents in regex?
by jhourcle (Prior) on Aug 09, 2007 at 17:34 UTC

    I've never had to do it, but from perlre, it seems like it's just a matter of using the proper locale:

    If the "utf8" pragma is not used but the "locale" pragma is, the classes correlate with the usual isalpha(3) interface (except for `word' and `blank').

    and from perllocale :

    Here is a code snippet to tell what "word" characters are in the cur- rent locale, in that locale's order:

    use locale; print +(sort grep /\w/, map { chr } 0..255), "\n";

    Compare this with the characters that you see and their order if you state explicitly that the locale should be ignored:

    no locale; print +(sort grep /\w/, map { chr } 0..255), "\n";