theirpuppet has asked for the wisdom of the Perl Monks concerning the following question:

i'm trying to add the lamest amount of internationalization to some cgi stuff i'm doing. trying to allow some extended ascii chars through a regex that basically looks like this:
$line =~ s/[^0-9a-z\x80-\xA5_:,!\?\.\* -]//ig;
notice i'm using hexadecimal escaping for ascii chars (decimal) 128-165. it doesn't work. trying to pass any accented char, and it just rips it away. any advice is appreciated, including how to optimize my longass regexes (i've got longer than this, but i'm trying to take the approach of default deny - a firewall methodology).

Replies are listed 'Best First'.
Re: extended ascii regex
by BrowserUk (Patriarch) on Dec 11, 2002 at 06:07 UTC

    Seems to work ok for me?

    $s = do{ $£ .= chr for 0..255; $£ }; print $s; ☺☻♥♦ ♫☼►◄↕‼¶§▬↨↠+‘↓→←∟↔▲▼ !"#$%&'()*+,-. +/0123456789:;<=>? @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~⌂ ÇüéâäàåçêëèïîìÄÅà +æ ÆôöòûùÿÖÜ¢£¥₧ƒáí +óúñѪº¿⌐¬½¼¡«»░ ▒▓│┤╡╢╖╕╣â +•‘╗╝╜�›┐└┴┠+├─┼╞╟╚╔╩╦â + â•â•¬â•§â•¨â•¤â•¥â•™â•˜â•’ +╓╫╪┘┌█▄▌▐â +€ αßΓπΣσµτΦΘΩδ∞φε∠+©â‰¡Â±â‰¥â‰¤âŒ âŒ¡Ã·â‰ˆÂ°âˆ™Â·âˆ +šâ¿Â²â– Â  $s =~ s/[^0-9a-z\x80-\xA5_:,!\?\.\* -]//ig; print $s; !*,-.0123456789:?ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxy +z ÇüéâäàåçêëèïîìÄÅà +æÆôöòûùÿÖÜ¢£¥₧ƒ +¡Ã­Ã³ÃºÃ±Ã‘

    That said, tr///cd is probably a better way to do this.

    $s =~ tr/[0-9a-z\x80-\xA5_:,!\?\.\* -]//cd;


    Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
    Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
    Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
    Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Re: extended ascii regex
by seattlejohn (Deacon) on Dec 11, 2002 at 05:54 UTC
    Works fine for me, e.g.:
    use strict; my @allow = ("\xA1","\x89","\x94"); my @disallow = ("\xB2","\xFF"); foreach my $char (@allow) { $char =~ s/[^0-9a-z\x80-\xA5_:,!\?\.\* -]//ig; print "wrong: char " . ord($char) . " should be allowed\n" if $char +eq ''; } foreach my $char (@disallow) { $char =~ s/[^0-9a-z\x80-\xA5_:,!\?\.\* -]//ig; print "wrong: char " . ord($char) . " should not be allowed\n" if $c +har ne ''; } exit;

    (Tested on perl 5.6.1, ActiveState build 631, Windows XP.) Perhaps you could post some examples of specific input items that generate unexpected results?

            $perlmonks{seattlejohn} = 'John Clyman';

Re: extended ascii regex
by submersible_toaster (Chaplain) on Dec 11, 2002 at 06:17 UTC

    Uh-huh. . . it just rips it away is not really surprising since that is what you're asking this regex to do!

    To paraphrase, Wherever a non-digit , non-lowercase alphabet char , ascii char NOT between 0x80 and 0xA5, non_, non! non? non. non* non(space) non- is found, replace it with nothing.

    Let's assume that you wish to accept alphanumerics, a selection of punctuation chars, and anything between \x80 and \xA5.

    I am as puzzled as you


    Update: I'll keep quiet now.See above
Re: extended ascii regex
by theirpuppet (Sexton) on Dec 11, 2002 at 19:36 UTC
    the regex is designed, and intended, to rip out all chars that are not explicitly listed as cool (hence the ^).

    the hexadecimal range is not used. i tested with, multiple character sequences, the exact regex in CGI, and console-based scripts with user-input and statically defined variables. the result is always the same, the extended ascii chars are removed.

    is there a reason why the hexadecimal range may not be by the perl interpreter? note that there are no warnings under -w. also note that i've even isolated the hexadecimal range in an if ($line =~ /\x80-\xA5/) {print "$line\n"} but nothing happens...