in reply to What's the 'M-' characters and how to filter/correct them?

Your data is not ASCII.

From the documentation for cat on my system:

-v Non-ASCII characters (with the high bit set) are printed as `M-' (for meta) followed by the character for the low 7 bits.

Edit: That's all, really. The below advice may not be needed as your only symptom seems to be that cat -v on your system doesn't display the characters correctly, as documented.

You should decode the data, or try adding at the top of the script:

binmode STDIN, ':utf8';

See Encode, as well as perlunitut and perluniintro.

update: added links to docs
The way forward always starts with a minimal test.

Replies are listed 'Best First'.
Re^2: What's the 'M-' characters and how to filter/correct them?
by shmem (Chancellor) on Jan 19, 2016 at 10:13 UTC
    binmode STDIN, ':utf8';

    See Encode, as well as perlunitut and perluniintro.

    But the data shown doesn't seem to be unicode. If it was, this

    DepM-ssito Centralizado

    would instead be

    DepM-CM-3sito Centralizado

    So, the data is some ISO-8859 variant. In ISO-8859 the ó is chr(243), which is chr(ord ('s') | 128) (hence the output as M-s) and the character with the high bit set in

    London andM- NewYork

    is most likely chr(160), i.e. a non-breaking space - chr(ord (' ') | 128).

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      Indeed. Thanks for clarifying that. I didn't mean to say the OPs data was UTF-8; it was just an example using a common encoding.

      The way forward always starts with a minimal test.
Re^2: What's the 'M-' characters and how to filter/correct them?
by sylph001 (Sexton) on Jan 19, 2016 at 09:23 UTC

    Thank you for the hint.

    On other hand, do you know how can I filter out the 'M-' characters, if they turn out to be trivial?

      There are no "M-" characters as such. That's just how cat is displaying the non-ASCII characters. The ones that you say are "trivial" are probably some kind of white space character that you don't notice in the spreadsheet.

      Once you know what characters they are, for example as suggested in Re: What's the 'M-' characters and how to filter/correct them?, you can remove them with a regular expression. For example, here's a situation I dealt with recently involving invisible special characters that were causing problems with web browsers:

      # U+2028 ('Line Separator') and U+2029 ('Paragraph Separator') + are valid JSON # but cause a parse error in the browser. So we remove them. $job_xml =~ s/\x{2028}|\x{2029}//sg;


      update: showed recent example
      The way forward always starts with a minimal test.

        Thank you for the explaination.

        I think I'm able to get my script recognize the non-ascii characters out of the pieces of data.

        However when I'm trying to remove/replace the non-ascii characters using the regex, it result still shows some unexpected characters (wrapped in point brackets) left in the position. Examples like following:

         25             $line =~ s/[^:ascii]//g;

         26             print $out_hdl "$line";

        Result:

        11AM<A0> LONDON

        Dep<F3>sito Centralizado

        This seems not like what I saw from the various examples on internet.

        So, do you have ideas what's left there, and how could it be fully removed by this kind of regex?

         

        Thanks