sylph001 has asked for the wisdom of the Perl Monks concerning the following question:

Respected monks,


Recently I got an Excel spreadsheet and I extracted the data from it so that I can populate these data into a database.
However from the extracted data, I see some outstanding 'M-' characters.
It looks interesting that some of the 'M-' characters seems was interpreted from the visible special characters from the original spreadsheet, while the other of the 'M-' looks like trivail.
I paste each of the examples below:

'M-' that looks come from special characters
user@server> cat data1.txt
Depósito Centralizado
user@server> cat -v data1.txt
DepM-ssito Centralizado

'M-' that looks trivial
user@server> cat data2.txt
London and  NewYork
user@server> cat -v data2.txt
London andM-  NewYork

The spreadsheet was generated from Windows platform, and I'm extracting it on linux.
To make sure to filter out the usually mentioned "Windows Control Characters" like the '^M' and so on, I used dos2unix for each of the data files, but the 'M-' characters didn't disappear.

So, my monks, could you kindly let me know how do these 'M-' characters come from? Are they also the control characters?
Another question is, is it possible to use perl to clean-up/correct these 'M-' characters?

Many thanks in advance

  • Comment on What's the 'M-' characters and how to filter/correct them?

Replies are listed 'Best First'.
Re: What's the 'M-' characters and how to filter/correct them?
by shmem (Chancellor) on Jan 19, 2016 at 10:52 UTC

    So, my monks, could you kindly let me know how do these 'M-' characters come from? Are they also the control characters?

    Another question is, is it possible to use perl to clean-up/correct these 'M-' characters?

    These are characters with the 8th bit set, and the M-s is cats way to display a "ó" if you ask for it. The examples you provided are ISO-8859 (or some variant). If your database uses the same encoding, there's no need to clean up your data. If your databases uses UTF-8, the following should suffice

    use Encode qw(from_to); while (<>) { from_to($_,'latin1','utf-8'); print; }

    to convert your data to UTF-8.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: What's the 'M-' characters and how to filter/correct them?
by 1nickt (Canon) on Jan 19, 2016 at 09:14 UTC

    Your data is not ASCII.

    From the documentation for cat on my system:

    -v Non-ASCII characters (with the high bit set) are printed as `M-' (for meta) followed by the character for the low 7 bits.

    Edit: That's all, really. The below advice may not be needed as your only symptom seems to be that cat -v on your system doesn't display the characters correctly, as documented.

    You should decode the data, or try adding at the top of the script:

    binmode STDIN, ':utf8';

    See Encode, as well as perlunitut and perluniintro.

    update: added links to docs
    The way forward always starts with a minimal test.
      binmode STDIN, ':utf8';

      See Encode, as well as perlunitut and perluniintro.

      But the data shown doesn't seem to be unicode. If it was, this

      DepM-ssito Centralizado

      would instead be

      DepM-CM-3sito Centralizado

      So, the data is some ISO-8859 variant. In ISO-8859 the ó is chr(243), which is chr(ord ('s') | 128) (hence the output as M-s) and the character with the high bit set in

      London andM- NewYork

      is most likely chr(160), i.e. a non-breaking space - chr(ord (' ') | 128).

      perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

        Indeed. Thanks for clarifying that. I didn't mean to say the OPs data was UTF-8; it was just an example using a common encoding.

        The way forward always starts with a minimal test.

      Thank you for the hint.

      On other hand, do you know how can I filter out the 'M-' characters, if they turn out to be trivial?

        There are no "M-" characters as such. That's just how cat is displaying the non-ASCII characters. The ones that you say are "trivial" are probably some kind of white space character that you don't notice in the spreadsheet.

        Once you know what characters they are, for example as suggested in Re: What's the 'M-' characters and how to filter/correct them?, you can remove them with a regular expression. For example, here's a situation I dealt with recently involving invisible special characters that were causing problems with web browsers:

        # U+2028 ('Line Separator') and U+2029 ('Paragraph Separator') + are valid JSON # but cause a parse error in the browser. So we remove them. $job_xml =~ s/\x{2028}|\x{2029}//sg;


        update: showed recent example
        The way forward always starts with a minimal test.
Re: What's the 'M-' characters and how to filter/correct them?
by Anonymous Monk on Jan 19, 2016 at 09:18 UTC
    Use Data::Dump or od -tacx1 or hexdump .... to see what bytes you're dealing with