in reply to Re^2: What's the 'M-' characters and how to filter/correct them?
in thread What's the 'M-' characters and how to filter/correct them?

There are no "M-" characters as such. That's just how cat is displaying the non-ASCII characters. The ones that you say are "trivial" are probably some kind of white space character that you don't notice in the spreadsheet.

Once you know what characters they are, for example as suggested in Re: What's the 'M-' characters and how to filter/correct them?, you can remove them with a regular expression. For example, here's a situation I dealt with recently involving invisible special characters that were causing problems with web browsers:

# U+2028 ('Line Separator') and U+2029 ('Paragraph Separator') + are valid JSON # but cause a parse error in the browser. So we remove them. $job_xml =~ s/\x{2028}|\x{2029}//sg;


update: showed recent example
The way forward always starts with a minimal test.

Replies are listed 'Best First'.
Re^4: What's the 'M-' characters and how to filter/correct them?
by sylph001 (Sexton) on Jan 20, 2016 at 09:54 UTC

    Thank you for the explaination.

    I think I'm able to get my script recognize the non-ascii characters out of the pieces of data.

    However when I'm trying to remove/replace the non-ascii characters using the regex, it result still shows some unexpected characters (wrapped in point brackets) left in the position. Examples like following:

     25             $line =~ s/[^:ascii]//g;

     26             print $out_hdl "$line";

    Result:

    11AM<A0> LONDON

    Dep<F3>sito Centralizado

    This seems not like what I saw from the various examples on internet.

    So, do you have ideas what's left there, and how could it be fully removed by this kind of regex?

     

    Thanks

      I am beginning to suspect that you have an XY Problem. Why do you have to sanitize your data in the first place? To what end? Is it really useful to just weed out characters and turn Depósito into Depsito?

      Then, your character class definition is incomplete. It should be [^:ascii:] - the last colon was missing.

      perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

        Yes you get the point, and thanks for driving my thought out of the swirl.

        After thought a bit more, I think what I want is like following:
        If they are just Non-English language characters, I would rather keep them.
        For the other cases, like the Excel empty character, I think it should be removed as long as the visible content remains unchanged.

        So, think my script needs to know what exactly the extended characters really are, from the data file...
        By checking the long 8859-1 list, it looks like I either have to list every non-english language characters (in Dec/Hex form) in my regex code, or I have to list all the garbage-like characters...

        As this may make the the code hard to maintain, would there be a way to conclude all the useful/non-useful characters in one catagory, in the regex?
        The ideal code I'm dreaming about probably looks like following:

        # if any non-english language chara in the line if ( $line =~ /[[:non-english_chara_class:]]/g ) { print "Keep the content as is"; } elsif ( $line =~ /[[:garbage_chara_class:]]/g ) { # do some filtering }

        Would that be possible?