theirpuppet has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to find the ranges that can be used in a regex for certain extended ascii characters.
This is a script using Perl 5.8.0 under mod_perl 1.28 and apache 1.3.28 (linux).

According to: http://www.asciitable.com/ I want the following ranges: 128-151 and 160-165.

According to: http://natali.mine.nu/test2.html, they are \x80-x97 and \xa0-\xa5

My regex is:

$param =~ s/[^a-zA-Z0-9\.\-\=\+\!\@\#\$\%\^\&\*\?\ \x80-\x97\xa0-\xa5] +//g;

Well, anything I put in one of the ranges gets ripped out. They should stay, along with a-z stuff.

Please help.

Replies are listed 'Best First'.
Re: extended ASCII regex range
by graff (Chancellor) on Oct 21, 2003 at 04:06 UTC
    So, you don't want to keep the underscore character ("_")? Or comma, colon, semi-colon, slash, backslash, tilde, single/double quotes, parens, curly or square brackets? (Just checking -- enumerations of characters like you have there can be prone to leaving things out by mistake.)

    On the page that you cite ("www.asciitable.com"), the "extended ASCII set" listed there is actually known as "PC Code Page 437" (or "cp437"), which was developed for the original IBM PCs running MS-DOS, was inherited by virtually all IBM clones, and is therefore arguably "the most popular" (as asserted on that page).

    Perl 5.8's "Encode" module can "decode" such data into utf8, so that you can deal with it as character data, rather than as byte values; and it can then "encode" it again as cp437 for output, if you want to keep to the old character set. Note that the accented characters in utf8 will be two bytes each, and will be useless when treated by any non-utf8-capable display tool or process. (The perl-internal treatment of utf8 character data in 5.8 allows you to ignore the single-byte vs. multi-byte distinction when writing the script -- every character is just a character (matches "." in a regex, etc), no matter how many bytes are needed to express it in utf8.)

    The perl 5.8 man pages perluniintro, perlunicode, Encode, PerlIO and PerlIO::Encoding all have useful information on this and related issues.

    If you would prefer that the data remain in cp437 encoding, and have the perl script treat is as byte values as shown in your script, you will need one or more of the following pragmas in your script (depending, perhaps, on which linux distro/version you have):

    no utf8; use bytes;
    You may even have to specify an IO mode when opening the input and/or output files:
    open( IN, "<:raw", "input.file" ) or die $!; open( OUT, ">:raw", "output.file" ) or die $!; # or, if you're dealing with STDIN and/or STDOUT: binmode STDIN, ":raw"; binmode STDOUT, ":raw";
    This will make sure that perl doesn't try to treat the data as utf8-encoded text.
Re: extended ASCII regex range
by PodMaster (Abbot) on Oct 21, 2003 at 01:56 UTC
    I highly doubt it :) your problem would appear to be elsewhere (perhaps scope related, as that is common with mod_perl)
    use strict; use warnings; my $param = 'abe '.chr(155).chr(156).'lincoln'; warn $param; $param =~ s/[^a-zA-Z0-9\.\-\=\+\!\@\#\$\%\^\&\*\?\ \x80-\x97\xa0-\xa5] +//g; die $param; __END__ abe ¢£lincoln at regexS.pl line 4. abe lincoln at regexS.pl line 6.
    excuse me , with 5.6.1, use utf8 helps, so it appears to be an encoding issue
    Excuse me again. With 5.6.1 use utf8 does not help ;) There however appears to be no issue with the following example under 5.6.1 or 5.8.0
    use strict; use warnings; #use utf8; my $param = 'abe '.chr(133).' in range lincoln '.chr(152).' out of ran +ge'; #my $param = "abe \x85 in range lincoln \x98 out of range"; print $param,$/; $param =~ s/[^a-zA-Z0-9\.\-\=\+\!\@\#\$\%\^\&\*\?\ \x80-\x97\xa0-\xa5] +/X/g; print $param,$/; __END__ E:\dev\LOOSE>perl regex.utf8.pl abe à in range lincoln ÿ out of range abe à in range lincoln X out of range E:\dev\LOOSE>perl -Mutf8 regex.utf8.pl abe à in range lincoln ÿ out of range abe X in range lincoln X out of range E:\dev\LOOSE>G:\perl\bin\perl regex.utf8.pl abe à in range lincoln ÿ out of range abe à in range lincoln X out of range E:\dev\LOOSE>G:\perl\bin\perl -Mutf8 regex.utf8.pl abe à in range lincoln ÿ out of range abe à in range lincoln X out of range E:\dev\LOOSE>
    so my guess is still that it's some kind of encoding issue (if there really is an issue at all).

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.