So, you don't want to keep the underscore character ("_")? Or comma, colon, semi-colon, slash, backslash, tilde, single/double quotes, parens, curly or square brackets? (Just checking -- enumerations of characters like you have there can be prone to leaving things out by mistake.)
On the page that you cite ("www.asciitable.com"), the "extended ASCII set" listed there is actually known as "PC Code Page 437" (or "cp437"), which was developed for the original IBM PCs running MS-DOS, was inherited by virtually all IBM clones, and is therefore arguably "the most popular" (as asserted on that page).
Perl 5.8's "Encode" module can "decode" such data into utf8, so that you can deal with it as character data, rather than as byte values; and it can then "encode" it again as cp437 for output, if you want to keep to the old character set. Note that the accented characters in utf8 will be two bytes each, and will be useless when treated by any non-utf8-capable display tool or process. (The perl-internal treatment of utf8 character data in 5.8 allows you to ignore the single-byte vs. multi-byte distinction when writing the script -- every character is just a character (matches "." in a regex, etc), no matter how many bytes are needed to express it in utf8.)
The perl 5.8 man pages perluniintro, perlunicode, Encode, PerlIO and PerlIO::Encoding all have useful information on this and related issues.
If you would prefer that the data remain in cp437 encoding, and have the perl script treat is as byte values as shown in your script, you will need one or more of the following pragmas in your script (depending, perhaps, on which linux distro/version you have):
no utf8;
use bytes;
You may even have to specify an IO mode when opening the input and/or output files:
open( IN, "<:raw", "input.file" ) or die $!;
open( OUT, ">:raw", "output.file" ) or die $!;
# or, if you're dealing with STDIN and/or STDOUT:
binmode STDIN, ":raw";
binmode STDOUT, ":raw";
This will make sure that perl doesn't try to treat the data as utf8-encoded text. | [reply] [d/l] [select] |
use strict;
use warnings;
#use utf8;
my $param = 'abe '.chr(133).' in range lincoln '.chr(152).' out of ran
+ge';
#my $param = "abe \x85 in range lincoln \x98 out of range";
print $param,$/;
$param =~ s/[^a-zA-Z0-9\.\-\=\+\!\@\#\$\%\^\&\*\?\ \x80-\x97\xa0-\xa5]
+/X/g;
print $param,$/;
__END__
E:\dev\LOOSE>perl regex.utf8.pl
abe à in range lincoln ÿ out of range
abe à in range lincoln X out of range
E:\dev\LOOSE>perl -Mutf8 regex.utf8.pl
abe à in range lincoln ÿ out of range
abe X in range lincoln X out of range
E:\dev\LOOSE>G:\perl\bin\perl regex.utf8.pl
abe à in range lincoln ÿ out of range
abe à in range lincoln X out of range
E:\dev\LOOSE>G:\perl\bin\perl -Mutf8 regex.utf8.pl
abe à in range lincoln ÿ out of range
abe à in range lincoln X out of range
E:\dev\LOOSE>
so my guess is still that it's some kind of encoding issue
(if there really is an issue at all).
| MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!" | | I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README). | | ** The third rule of perl club is a statement of fact: pod is sexy. |
| [reply] [d/l] [select] |