We need to separate the problem into three parts:

  1. The source of your data, and the encoding there
  2. The Perl program and how the string is marked there
  3. The output of your data, and the encoding there

In your first program, you have a source file that is plain ASCII. In the program, you hand Perl two octets that represent the UTF-8 encoding. So Perl thinks this string should have length 2, because it consists of two bytes and is a "Latin-1" string. When printing your data, you don't tell Perl that there should be anything special done, so Perl assumes you want Latin-1 as output format. Latin-1 means no modification to your string is made. Your console expects UTF-8 and the two bytes that Perl outputs happen to map to Eacute.

Here, adding a binmode STDOUT, ':encoding(UTF-8)'; should Perl tell that you want UTF-8 on output, and using my $string= decode('UTF-8', "\x{c3}\x{a9}"); to tell Perl that you want the string parts to be interpreted as UTF-8 should change the program to suit what you want.

In your second program, you have a source file that is UTF-8. In the program, you hand Perl two octets that represent the UTF-8 encoding, and tell Perl that the program source is UTF-8. So Perl thinks this string should have length 1, because it consists of two bytes and is an "UTF-8" string. When printing your data, you don't tell Perl that there should be anything special done, so Perl assumes you want Latin-1 as output format. So Perl converts your UTF-8 string to Latin-1 when printing it. Your console expects UTF-8 and the single byte that Perl outputs happens to be an invalid UTF-8 sequence.

Here, you only need to tell Perl that you want UTF-8 on output by using binmode on STDOUT.

The two modules you use expect Unicode input, but you hand them byte sequences. You want to use Encode::decode to decode them to real Unicode strings:

use Encode 'decode'; my $string= decode 'UTF-8', "\x{c3}\x{a9}"; ...

In reply to Re: Problems handling UTF8 ! And removing accents. by Corion
in thread Problems handling UTF8 ! And removing accents. by prunkdump

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.