in reply to extended ASCII regex range
On the page that you cite ("www.asciitable.com"), the "extended ASCII set" listed there is actually known as "PC Code Page 437" (or "cp437"), which was developed for the original IBM PCs running MS-DOS, was inherited by virtually all IBM clones, and is therefore arguably "the most popular" (as asserted on that page).
Perl 5.8's "Encode" module can "decode" such data into utf8, so that you can deal with it as character data, rather than as byte values; and it can then "encode" it again as cp437 for output, if you want to keep to the old character set. Note that the accented characters in utf8 will be two bytes each, and will be useless when treated by any non-utf8-capable display tool or process. (The perl-internal treatment of utf8 character data in 5.8 allows you to ignore the single-byte vs. multi-byte distinction when writing the script -- every character is just a character (matches "." in a regex, etc), no matter how many bytes are needed to express it in utf8.)
The perl 5.8 man pages perluniintro, perlunicode, Encode, PerlIO and PerlIO::Encoding all have useful information on this and related issues.
If you would prefer that the data remain in cp437 encoding, and have the perl script treat is as byte values as shown in your script, you will need one or more of the following pragmas in your script (depending, perhaps, on which linux distro/version you have):
You may even have to specify an IO mode when opening the input and/or output files:no utf8; use bytes;
This will make sure that perl doesn't try to treat the data as utf8-encoded text.open( IN, "<:raw", "input.file" ) or die $!; open( OUT, ">:raw", "output.file" ) or die $!; # or, if you're dealing with STDIN and/or STDOUT: binmode STDIN, ":raw"; binmode STDOUT, ":raw";
|
|---|