in reply to Re^3: What's the 'M-' characters and how to filter/correct them?
in thread What's the 'M-' characters and how to filter/correct them?

Thank you for the explaination.

I think I'm able to get my script recognize the non-ascii characters out of the pieces of data.

However when I'm trying to remove/replace the non-ascii characters using the regex, it result still shows some unexpected characters (wrapped in point brackets) left in the position. Examples like following:

 25             $line =~ s/[^:ascii]//g;

 26             print $out_hdl "$line";

Result:

11AM<A0> LONDON

Dep<F3>sito Centralizado

This seems not like what I saw from the various examples on internet.

So, do you have ideas what's left there, and how could it be fully removed by this kind of regex?

 

Thanks

  • Comment on Re^4: What's the 'M-' characters and how to filter/correct them?

Replies are listed 'Best First'.
Re^5: What's the 'M-' characters and how to filter/correct them?
by shmem (Chancellor) on Jan 20, 2016 at 10:17 UTC

    I am beginning to suspect that you have an XY Problem. Why do you have to sanitize your data in the first place? To what end? Is it really useful to just weed out characters and turn Depósito into Depsito?

    Then, your character class definition is incomplete. It should be [^:ascii:] - the last colon was missing.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      Yes you get the point, and thanks for driving my thought out of the swirl.

      After thought a bit more, I think what I want is like following:
      If they are just Non-English language characters, I would rather keep them.
      For the other cases, like the Excel empty character, I think it should be removed as long as the visible content remains unchanged.

      So, think my script needs to know what exactly the extended characters really are, from the data file...
      By checking the long 8859-1 list, it looks like I either have to list every non-english language characters (in Dec/Hex form) in my regex code, or I have to list all the garbage-like characters...

      As this may make the the code hard to maintain, would there be a way to conclude all the useful/non-useful characters in one catagory, in the regex?
      The ideal code I'm dreaming about probably looks like following:

      # if any non-english language chara in the line if ( $line =~ /[[:non-english_chara_class:]]/g ) { print "Keep the content as is"; } elsif ( $line =~ /[[:garbage_chara_class:]]/g ) { # do some filtering }

      Would that be possible?

        The ideal code I'm dreaming about probably looks like following:
        # if any non-english language chara in the line if ( $line =~ /[[:non-english_chara_class:]]/g ) { print "Keep the content as is"; } elsif ( $line =~ /[[:garbage_chara_class:]]/g ) { # do some filtering }

        Your if/else block wouldn't work as expected, since the if-condition would always be true as long as there is one character pertaining to the [[:non-english_chara_class:]] among any number of [[:garbage_chara_class:]] characters. Perhaps match the garbage first and remove it.

        Apart from that, it is entirely possible to define your own "named" character classes using the (yet experimental) (?[ ]) regular expression construct, available from perl v5.18 onwards (see Extended Bracketed Character Classes in perlrecharclass):

        no warnings "experimental::regex_sets"; my @expressions = ( "Depósito Centralizado", "voilà: a word with an accent grave", 'this is plain ascii', 'gräßliches Tröten', ); my $acute = join'', map { chr $_ } (193,201,205,211,218,221,225,233,23 +7,243,250,253); my $grave = join'', map { chr $_ } (192,200,204,210,217,224,232,236,24 +2,249); my $spanish = "[:ascii:] + $acute"; my $french = "$spanish + $grave"; for (@expressions) { if (/^[[:ascii:]]+$/) { print 'ascii'; } elsif (/^(?[[$spanish]])+$/) { print 'spanish'; } elsif (/^(?[[$french]])+$/) { print 'french'; } else { print 'unknown'; } print ": $_\n"; } __END__ spanish: Depósito Centralizado french: voilà: a word with an accent grave ascii: this is plain ascii unknown: gräßliches Tröten

        The above character classes are far from complete, but you get the meaning ;-)

        <update> correct ascii test and move to first test </update>

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

        ... no code tags ..

        start using code tags please  <c> code here </c>

        Further to shmem's post above:
        If it's just a question of translating text from single seventh-bit-set characters to some kind of pure-ASCII representation, it's possible to do this in one swell foop. The process of defining the translations can be a bit tedious, but it's done just once (update: and could even be done in a module for general inclusion). Note that in the code below, the German eszett/sharp-s "ß" translates to the "ss" ASCII letter pair. See also the possibly useful discussion of something like this for multi-byte Unicode sequences in the recent threads Read RegEx from file and particularly Re: Read RegEx from file.

        use warnings; use strict; use Data::Dump qw(dd); my @acutes = qw(193 A 201 E 205 I 211 O 218 U 225 a 233 e 237 i + 243 o 250 u 221 Y 253 y); my @graves = qw(192 A 200 E 204 I 210 O 217 U 224 a 232 e 236 i + 242 o 249 u); my @others = qw(228 a 246 o 223 ss); my %xlate = (@acutes, @graves, @others); # dd \%xlate; # FOR DEBUG my ($search) = map qr{[$_]}xms, join '', map sprintf('\%03o', $_), keys %xlate ; # dd $search; # FOR DEBUG while (my $line = <DATA>) { chomp $line; $line =~ s{ ($search) }{$xlate{ord $1}}xmsg; die "non-ascii in '$line'" if $line =~ m{ [[^:ascii:]] }xms; print "'$line' \n"; } __DATA__ spanish: Depósito Centralízado french: voilà: a word with an accent gravè vanilla: this is plain ascii other: gräßliches Tröten
        Output:
        c:\@Work\Perl\monks\sylph001>perl xlate_to_ascii_1.pl 'spanish: Deposito Centralizado' 'french: voila: a word with an accent grave' 'vanilla: this is plain ascii' 'other: grassliches Troten'


        Give a man a fish:  <%-{-{-{-<