in reply to Re^5: What's the 'M-' characters and how to filter/correct them?
in thread What's the 'M-' characters and how to filter/correct them?

Yes you get the point, and thanks for driving my thought out of the swirl.

After thought a bit more, I think what I want is like following:
If they are just Non-English language characters, I would rather keep them.
For the other cases, like the Excel empty character, I think it should be removed as long as the visible content remains unchanged.

So, think my script needs to know what exactly the extended characters really are, from the data file...
By checking the long 8859-1 list, it looks like I either have to list every non-english language characters (in Dec/Hex form) in my regex code, or I have to list all the garbage-like characters...

As this may make the the code hard to maintain, would there be a way to conclude all the useful/non-useful characters in one catagory, in the regex?
The ideal code I'm dreaming about probably looks like following:

# if any non-english language chara in the line if ( $line =~ /[[:non-english_chara_class:]]/g ) { print "Keep the content as is"; } elsif ( $line =~ /[[:garbage_chara_class:]]/g ) { # do some filtering }

Would that be possible?

Replies are listed 'Best First'.
Re^7: What's the 'M-' characters and how to filter/correct them?
by shmem (Chancellor) on Jan 21, 2016 at 15:28 UTC
    The ideal code I'm dreaming about probably looks like following:
    # if any non-english language chara in the line if ( $line =~ /[[:non-english_chara_class:]]/g ) { print "Keep the content as is"; } elsif ( $line =~ /[[:garbage_chara_class:]]/g ) { # do some filtering }

    Your if/else block wouldn't work as expected, since the if-condition would always be true as long as there is one character pertaining to the [[:non-english_chara_class:]] among any number of [[:garbage_chara_class:]] characters. Perhaps match the garbage first and remove it.

    Apart from that, it is entirely possible to define your own "named" character classes using the (yet experimental) (?[ ]) regular expression construct, available from perl v5.18 onwards (see Extended Bracketed Character Classes in perlrecharclass):

    no warnings "experimental::regex_sets"; my @expressions = ( "Depósito Centralizado", "voilà: a word with an accent grave", 'this is plain ascii', 'gräßliches Tröten', ); my $acute = join'', map { chr $_ } (193,201,205,211,218,221,225,233,23 +7,243,250,253); my $grave = join'', map { chr $_ } (192,200,204,210,217,224,232,236,24 +2,249); my $spanish = "[:ascii:] + $acute"; my $french = "$spanish + $grave"; for (@expressions) { if (/^[[:ascii:]]+$/) { print 'ascii'; } elsif (/^(?[[$spanish]])+$/) { print 'spanish'; } elsif (/^(?[[$french]])+$/) { print 'french'; } else { print 'unknown'; } print ": $_\n"; } __END__ spanish: Depósito Centralizado french: voilà: a word with an accent grave ascii: this is plain ascii unknown: gräßliches Tröten

    The above character classes are far from complete, but you get the meaning ;-)

    <update> correct ascii test and move to first test </update>

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re^7: What's the 'M-' characters and how to filter/correct them?
by Anonymous Monk on Jan 20, 2016 at 22:01 UTC

    ... no code tags ..

    start using code tags please  <c> code here </c>

      Oops... sorry I missed the tags for the code...

      I have updated the thread now, hope it looks easier to read.

      Thanks

Re^7: What's the 'M-' characters and how to filter/correct them?
by AnomalousMonk (Archbishop) on Jan 21, 2016 at 20:05 UTC

    Further to shmem's post above:
    If it's just a question of translating text from single seventh-bit-set characters to some kind of pure-ASCII representation, it's possible to do this in one swell foop. The process of defining the translations can be a bit tedious, but it's done just once (update: and could even be done in a module for general inclusion). Note that in the code below, the German eszett/sharp-s "ß" translates to the "ss" ASCII letter pair. See also the possibly useful discussion of something like this for multi-byte Unicode sequences in the recent threads Read RegEx from file and particularly Re: Read RegEx from file.

    use warnings; use strict; use Data::Dump qw(dd); my @acutes = qw(193 A 201 E 205 I 211 O 218 U 225 a 233 e 237 i + 243 o 250 u 221 Y 253 y); my @graves = qw(192 A 200 E 204 I 210 O 217 U 224 a 232 e 236 i + 242 o 249 u); my @others = qw(228 a 246 o 223 ss); my %xlate = (@acutes, @graves, @others); # dd \%xlate; # FOR DEBUG my ($search) = map qr{[$_]}xms, join '', map sprintf('\%03o', $_), keys %xlate ; # dd $search; # FOR DEBUG while (my $line = <DATA>) { chomp $line; $line =~ s{ ($search) }{$xlate{ord $1}}xmsg; die "non-ascii in '$line'" if $line =~ m{ [[^:ascii:]] }xms; print "'$line' \n"; } __DATA__ spanish: Depósito Centralízado french: voilà: a word with an accent gravè vanilla: this is plain ascii other: gräßliches Tröten
    Output:
    c:\@Work\Perl\monks\sylph001>perl xlate_to_ascii_1.pl 'spanish: Deposito Centralizado' 'french: voila: a word with an accent grave' 'vanilla: this is plain ascii' 'other: grassliches Troten'


    Give a man a fish:  <%-{-{-{-<