comment on

The ideal code I'm dreaming about probably looks like following:
# if any non-english language chara in the line if ( $line =~ /[[:non-english_chara_class:]]/g ) { print "Keep the content as is"; } elsif ( $line =~ /[[:garbage_chara_class:]]/g ) { # do some filtering }
[download]

Your if/else block wouldn't work as expected, since the if-condition would always be true as long as there is one character pertaining to the [[:non-english_chara_class:]] among any number of [[:garbage_chara_class:]] characters. Perhaps match the garbage first and remove it.

Apart from that, it is entirely possible to define your own "named" character classes using the (yet experimental) (?[ ]) regular expression construct, available from perl v5.18 onwards (see Extended Bracketed Character Classes in perlrecharclass):

no warnings "experimental::regex_sets";
my @expressions = (
    "Depósito Centralizado",
    "voilà: a word with an accent grave",
    'this is plain ascii',
    'gräßliches Tröten',
);

my $acute = join'', map { chr $_ } (193,201,205,211,218,221,225,233,23
+7,243,250,253);
my $grave = join'', map { chr $_ } (192,200,204,210,217,224,232,236,24
+2,249);

my $spanish = "[:ascii:] + $acute";
my $french = "$spanish + $grave";

for (@expressions) {
    if (/^[[:ascii:]]+$/) {
        print 'ascii';
    } elsif (/^(?[[$spanish]])+$/) {
        print 'spanish';
    } elsif (/^(?[[$french]])+$/) {
        print 'french';
    } else {
        print 'unknown';
    }
    print ": $_\n";
}
__END__
spanish: Depósito Centralizado
french: voilà: a word with an accent grave
ascii: this is plain ascii
unknown: gräßliches Tröten
[download]

The above character classes are far from complete, but you get the meaning ;-)

<update> correct ascii test and move to first test </update>

perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

In reply to Re^7: What's the 'M-' characters and how to filter/correct them? by shmem
in thread What's the 'M-' characters and how to filter/correct them? by sylph001

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.