holli has asked for the wisdom of the Perl Monks concerning the following question:

And so it came to pass that holli was given the task to develop an webapp using Catalyst and the Template Toolkit. See!, the powers that must be obeyed commanded, in this form the user shall input his Name, for so they can be rewarded with theirs emails being personalized. But!, and the skies rumbled, Of course the name field will have to be checked via a regex to ensure there is no bad input!"

Some time later, holli had finished the controller for the form and startet testing. To his astonishment he had to pick up that the regex he used to match the users against, did fail. More precisely it does not match german umlauts, regardless if they are matched against a word character (\w) or an explicit character class [äÄöÖüÜ]. Now he knows about different encodings, but all files in his projects are encoded in UTF-8 (templates, html, code, everything).

Now, who can tell that poor sod how to proceed to make the regex match?


P.S.
The actuall regex used is qr/^\w[\w\s\-]+$/.
The umlauts seem to be correctly encoded in the request (nachname=M%C3%BCller => nachname=Müller)


holli, /regexed monk/

Replies are listed 'Best First'.
Re: WebApps and Encoding
by Corion (Patriarch) on May 03, 2007 at 16:56 UTC

    I'm not sure that your incoming strings will be UTF-8, or rather, the incoming strings might contain UTF-8 but they might not be tagged with the utf8 flag in Perl. It's much of guesswork what encoding browsers use on form submissions - a rule of thumb is that the browser might use the same encoding the page was sent with... So inspect the bytes you get back and maybe manually switch the UTF-8 flag on on your $nachname variable - then the (utf8) regex should/could/might actually match. You could also enlist the help of demerphq :)

    If that approach of tagging (or encoding) the data fails, maybe turn the regex around and reject what is not allowed:

    my @rules = ( # presuming ASCII machine # this will likely break on an EBCDIC machine [ qr/\x00-\x1f/,'Keine Steuerzeichen erlaubt' ], [ qr/\d/,'Keine Zahlen erlaubt' ], [ qr/!-@/,'Keine Sonderzeichen erlaubt' ], [ qr/\[-`~/,'Keine Sonderzeichen erlaubt' ], # ... ); for (@rules) { my ($rule,$message) = @$_; if ($nachname =~ /$rule/) { die $message; }; };
Re: WebApps and Encoding
by clinton (Priest) on May 03, 2007 at 17:38 UTC
    corion is right - the string you're matching against doesn't have the UTF8 flag turned on (assuming it is UTF8 which it is in your example).

    I run all my POST/GET params through Encode::decode('utf8',$string) before doing any pattern matching.

    Not sure what library you're using for parsing the POST/GET data, but libapreq (Apache2::Request etc) tries to figure out in which character set the data is encoded. I use that as a basis for deciding what charset to put into my decode statement.

Re: WebApps and Encoding
by shmem (Chancellor) on May 03, 2007 at 17:32 UTC
    Quick fix -

    converting your input to iso-8859-1 (a.k.a latin-1) via Encode will help. Here's a minimal CGI:

    #!/usr/bin/perl use CGI; use Encode qw(from_to);; use strict; my $q = CGI->new; my ( $name, $result); if ( $name = $q -> param ('name')) { my $tmp = $name; from_to ( $tmp,"utf-8","iso-8859-1"); $result = $tmp =~ /^\w[\w\s\-äÄöÖüÜ]+$/ ? 'ok' : 'failed'; } print $q -> header (-charset => 'utf-8'); print <<EOH; <html> <head> <title>Foo</title> <body> <form name="foo" method="post"> Name: <input name="name" /> <input type="submit" value="send" /> </form>Name: $name, parsing: $result </body> </head> </html> EOH

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: WebApps and Encoding
by kyle (Abbot) on May 03, 2007 at 17:47 UTC

    As a side comment, you might want to account for Irish names that have an apostrophe in them. Also, you might be matching some things that are not names.

    use Test::More; my $holli = qr/^\w[\w\s\-]+$/; my $kyle = qr{ \A \w [\w\s'-]+ \z }xms; my @good_names = ( 'Smith', q{O'Donnelly}, 'Smith-Jones' ); my @bad_names = ( '12', "\\Windows\\System", '' ); plan 'tests' => (2 * scalar @good_names) + (2 * scalar @bad_names); foreach my $name ( @good_names ) { ok( $name =~ $holli, "holli pattern matches [$name]" ); ok( $name =~ $kyle, "kyle pattern matches [$name]" ); } foreach my $name ( @bad_names ) { ok( $name !~ $holli, "holli pattern does not match [$name]" ); ok( $name !~ $kyle, "kyle pattern does not match [$name]" ); } __END__ 1..12 ok 1 - holli pattern matches [Smith] ok 2 - kyle pattern matches [Smith] not ok 3 - holli pattern matches [O'Donnelly] # Failed test 'holli pattern matches [O'Donnelly]' # in perlmonks.pl at line 18. ok 4 - kyle pattern matches [O'Donnelly] ok 5 - holli pattern matches [Smith-Jones] ok 6 - kyle pattern matches [Smith-Jones] not ok 7 - holli pattern does not match [12] # Failed test 'holli pattern does not match [12]' # in perlmonks.pl at line 22. not ok 8 - kyle pattern does not match [12] # Failed test 'kyle pattern does not match [12]' # in perlmonks.pl at line 23. ok 9 - holli pattern does not match [\Windows\System] ok 10 - kyle pattern does not match [\Windows\System] ok 11 - holli pattern does not match [] ok 12 - kyle pattern does not match [] # Looks like you failed 3 tests of 12.

    (OK, maybe Test::More was overkill to make my point.)

Re: WebApps and Encoding
by thundergnat (Deacon) on May 03, 2007 at 18:06 UTC

    Are you sure that your names will be constrained to Latin-1? If not, perhaps you should allow for Unicode from the start.

    Also, I prefer to check that there AREN'T any characters that AREN'T allowed rather than checking that EACH character IS.

    while (my $name = <DATA>){ chomp $name; print "$name - ". ($name =~ /[^\p{Alpha} '-]/ ? "FAIL\n" : "PASS +\n"); } __DATA__ Müller D'Augustine De Vries Badin-Powell 1 of the above 1337 |-|@(|<3|?