comment on

So, you don't want to keep the underscore character ("_")? Or comma, colon, semi-colon, slash, backslash, tilde, single/double quotes, parens, curly or square brackets? (Just checking -- enumerations of characters like you have there can be prone to leaving things out by mistake.)

On the page that you cite ("www.asciitable.com"), the "extended ASCII set" listed there is actually known as "PC Code Page 437" (or "cp437"), which was developed for the original IBM PCs running MS-DOS, was inherited by virtually all IBM clones, and is therefore arguably "the most popular" (as asserted on that page).

Perl 5.8's "Encode" module can "decode" such data into utf8, so that you can deal with it as character data, rather than as byte values; and it can then "encode" it again as cp437 for output, if you want to keep to the old character set. Note that the accented characters in utf8 will be two bytes each, and will be useless when treated by any non-utf8-capable display tool or process. (The perl-internal treatment of utf8 character data in 5.8 allows you to ignore the single-byte vs. multi-byte distinction when writing the script -- every character is just a character (matches "." in a regex, etc), no matter how many bytes are needed to express it in utf8.)

The perl 5.8 man pages perluniintro, perlunicode, Encode, PerlIO and PerlIO::Encoding all have useful information on this and related issues.

If you would prefer that the data remain in cp437 encoding, and have the perl script treat is as byte values as shown in your script, you will need one or more of the following pragmas in your script (depending, perhaps, on which linux distro/version you have):

no utf8;
use bytes;
[download]

You may even have to specify an IO mode when opening the input and/or output files:

open( IN, "<:raw", "input.file" ) or die $!;
open( OUT, ">:raw", "output.file" ) or die $!;

# or, if you're dealing with STDIN and/or STDOUT:

binmode STDIN, ":raw";
binmode STDOUT, ":raw";
[download]

This will make sure that perl doesn't try to treat the data as utf8-encoded text.

In reply to Re: extended ASCII regex range by graff
in thread extended ASCII regex range by theirpuppet

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.