extended ASCII regex range

theirpuppet has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to find the ranges that can be used in a regex for certain extended ascii characters.
This is a script using Perl 5.8.0 under mod_perl 1.28 and apache 1.3.28 (linux).

According to: http://www.asciitable.com/ I want the following ranges: 128-151 and 160-165.

According to: http://natali.mine.nu/test2.html, they are \x80-x97 and \xa0-\xa5

My regex is:

$param =~ s/[^a-zA-Z0-9\.\-\=\+\!\@\#\$\%\^\&\*\?\ \x80-\x97\xa0-\xa5]
+//g;
[download]

Well, anything I put in one of the ranges gets ripped out. They should stay, along with a-z stuff.

Please help.

Comment on extended ASCII regex range Download Code

Replies are listed 'Best First'.

Re: extended ASCII regex range
by graff (Chancellor) on Oct 21, 2003 at 04:06 UTC

On the page that you cite ("www.asciitable.com"), the "extended ASCII set" listed there is actually known as "PC Code Page 437" (or "cp437"), which was developed for the original IBM PCs running MS-DOS, was inherited by virtually all IBM clones, and is therefore arguably "the most popular" (as asserted on that page).

Perl 5.8's "Encode" module can "decode" such data into utf8, so that you can deal with it as character data, rather than as byte values; and it can then "encode" it again as cp437 for output, if you want to keep to the old character set. Note that the accented characters in utf8 will be two bytes each, and will be useless when treated by any non-utf8-capable display tool or process. (The perl-internal treatment of utf8 character data in 5.8 allows you to ignore the single-byte vs. multi-byte distinction when writing the script -- every character is just a character (matches "." in a regex, etc), no matter how many bytes are needed to express it in utf8.)

The perl 5.8 man pages perluniintro, perlunicode, Encode, PerlIO and PerlIO::Encoding all have useful information on this and related issues.

If you would prefer that the data remain in cp437 encoding, and have the perl script treat is as byte values as shown in your script, you will need one or more of the following pragmas in your script (depending, perhaps, on which linux distro/version you have):

no utf8;
use bytes;
[download]

open( IN, "<:raw", "input.file" ) or die $!;
open( OUT, ">:raw", "output.file" ) or die $!;

# or, if you're dealing with STDIN and/or STDOUT:

binmode STDIN, ":raw";
binmode STDOUT, ":raw";
[download]

[reply]
[d/l]
[select]

Re: extended ASCII regex range
by PodMaster (Abbot) on Oct 21, 2003 at 01:56 UTC

I highly doubt it :) your problem would appear to be elsewhere (perhaps scope related, as that is common with mod_perl)
use strict; use warnings; my $param = 'abe '.chr(155).chr(156).'lincoln'; warn $param; $param =~ s/[^a-zA-Z0-9\.\-\=\+\!\@\#\$\%\^\&\*\?\ \x80-\x97\xa0-\xa5] +//g; die $param; __END__ abe ¢£lincoln at regexS.pl line 4. abe lincoln at regexS.pl line 6.
[download]

~~excuse me , with 5.6.1, use utf8 helps, so it appears to be an encoding issue~~

use strict;
use warnings;
#use utf8;

my $param = 'abe '.chr(133).' in range lincoln '.chr(152).' out of ran
+ge';
#my $param = "abe \x85 in range lincoln \x98 out of range";

print $param,$/;

$param =~ s/[^a-zA-Z0-9\.\-\=\+\!\@\#\$\%\^\&\*\?\ \x80-\x97\xa0-\xa5]
+/X/g;

print $param,$/;

__END__
E:\dev\LOOSE>perl regex.utf8.pl
abe à in range lincoln ÿ out of range
abe à in range lincoln X out of range

E:\dev\LOOSE>perl -Mutf8 regex.utf8.pl
abe à in range lincoln ÿ out of range
abe X in range lincoln X out of range

E:\dev\LOOSE>G:\perl\bin\perl regex.utf8.pl
abe à in range lincoln ÿ out of range
abe à in range lincoln X out of range

E:\dev\LOOSE>G:\perl\bin\perl -Mutf8 regex.utf8.pl
abe à in range lincoln ÿ out of range
abe à in range lincoln X out of range

E:\dev\LOOSE>
[download]

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]
[select]