The ideal code I'm dreaming about probably looks like following:
# if any non-english language chara in the line
if ( $line =~ /[[:non-english_chara_class:]]/g ) {
print "Keep the content as is";
}
elsif ( $line =~ /[[:garbage_chara_class:]]/g ) {
# do some filtering
}
Your if/else block wouldn't work as expected, since the if-condition would always be true as long as there is one character pertaining to the [[:non-english_chara_class:]] among any number of [[:garbage_chara_class:]] characters. Perhaps match the garbage first and remove it.
Apart from that, it is entirely possible to define your own "named" character classes using the (yet experimental) (?[ ]) regular expression construct, available from perl v5.18 onwards (see Extended Bracketed Character Classes in perlrecharclass):
no warnings "experimental::regex_sets";
my @expressions = (
"Depósito Centralizado",
"voilà: a word with an accent grave",
'this is plain ascii',
'gräßliches Tröten',
);
my $acute = join'', map { chr $_ } (193,201,205,211,218,221,225,233,23
+7,243,250,253);
my $grave = join'', map { chr $_ } (192,200,204,210,217,224,232,236,24
+2,249);
my $spanish = "[:ascii:] + $acute";
my $french = "$spanish + $grave";
for (@expressions) {
if (/^[[:ascii:]]+$/) {
print 'ascii';
} elsif (/^(?[[$spanish]])+$/) {
print 'spanish';
} elsif (/^(?[[$french]])+$/) {
print 'french';
} else {
print 'unknown';
}
print ": $_\n";
}
__END__
spanish: Depósito Centralizado
french: voilà: a word with an accent grave
ascii: this is plain ascii
unknown: gräßliches Tröten
The above character classes are far from complete, but you get the meaning ;-)
<update> correct ascii test and move to first test </update>
perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
|