The ideal code I'm dreaming about probably looks like following:# if any non-english language chara in the line if ( $line =~ /[[:non-english_chara_class:]]/g ) { print "Keep the content as is"; } elsif ( $line =~ /[[:garbage_chara_class:]]/g ) { # do some filtering }
Your if/else block wouldn't work as expected, since the if-condition would always be true as long as there is one character pertaining to the [[:non-english_chara_class:]] among any number of [[:garbage_chara_class:]] characters. Perhaps match the garbage first and remove it.
Apart from that, it is entirely possible to define your own "named" character classes using the (yet experimental) (?[ ]) regular expression construct, available from perl v5.18 onwards (see Extended Bracketed Character Classes in perlrecharclass):
no warnings "experimental::regex_sets"; my @expressions = ( "Depósito Centralizado", "voilà: a word with an accent grave", 'this is plain ascii', 'gräßliches Tröten', ); my $acute = join'', map { chr $_ } (193,201,205,211,218,221,225,233,23 +7,243,250,253); my $grave = join'', map { chr $_ } (192,200,204,210,217,224,232,236,24 +2,249); my $spanish = "[:ascii:] + $acute"; my $french = "$spanish + $grave"; for (@expressions) { if (/^[[:ascii:]]+$/) { print 'ascii'; } elsif (/^(?[[$spanish]])+$/) { print 'spanish'; } elsif (/^(?[[$french]])+$/) { print 'french'; } else { print 'unknown'; } print ": $_\n"; } __END__ spanish: Depósito Centralizado french: voilà: a word with an accent grave ascii: this is plain ascii unknown: gräßliches Tröten
The above character classes are far from complete, but you get the meaning ;-)
<update> correct ascii test and move to first test </update>
In reply to Re^7: What's the 'M-' characters and how to filter/correct them?
by shmem
in thread What's the 'M-' characters and how to filter/correct them?
by sylph001
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |