in reply to Re^5: unicode in perl
in thread unicode in perl

byte contentes are same $VAR1 = "\204qy\204{z\204\267{\204\217y\204\243\177\204\217~"; still extra character are displayed in text file qy{z{�y�~ instead of this qy{z{y~

Replies are listed 'Best First'.
Re^7: unicode in perl
by ikegami (Patriarch) on Jun 27, 2011 at 17:44 UTC

    The problem is that byte 8F is not defined in cp1252, so what you have isn't (valid) cp1252.

    Byte cp1252 Want Ok? ---- ------ ---------------------------------- --- 84 U+201E U+201E DOUBLE LOW-9 QUOTATION MARK yes 71 U+0071 U+0071 LATIN SMALL LETTER Q yes 79 U+0079 U+0079 LATIN SMALL LETTER Y yes 84 U+201E U+201E DOUBLE LOW-9 QUOTATION MARK yes 7B U+007B U+007B LEFT CURLY BRACKET yes 7A U+007A U+007A LATIN SMALL LETTER Z yes 84 U+201E U+201E DOUBLE LOW-9 QUOTATION MARK yes B7 U+00B7 U+00B7 MIDDLE DOT yes 7B U+007B U+007B LEFT CURLY BRACKET yes 84 U+201E U+201E DOUBLE LOW-9 QUOTATION MARK yes 8F ------ U+008F SINGLE SHIFT THREE NO! 79 U+0079 U+0079 LATIN SMALL LETTER Y yes 84 U+201E U+201E DOUBLE LOW-9 QUOTATION MARK yes A3 U+00A3 U+00A3 POUND SIGN yes 7F U+007F U+007F DELETE yes 84 U+201E U+201E DOUBLE LOW-9 QUOTATION MARK yes 8F ------ U+008F SINGLE SHIFT THREE NO! 7E U+007E U+007E TILDE yes

    It doesn't appear to be any other encoding either.

    use strict; use warnings; use feature qw( say ); use Encode qw( decode encode ); my $have = "\x84\x71\x79\x84\x7B\x7A\x84\xB7" . "\x7B\x84\x8F\x79\x84\xA3\x7F\x84" . "\x8F\x7E"; my $want = "\x{201E}\x{0071}\x{0079}\x{201E}" . "\x{007B}\x{007A}\x{201E}\x{00B7}" . "\x{007B}\x{201E}\x{008F}\x{0079}" . "\x{201E}\x{00A3}\x{007F}\x{201E}" . "\x{008F}\x{007E}"; for (Encode->encodings(':all')) { my $got; if (!eval { $got = decode($_, $have); 1 }) { warn $@; next; } say if $got eq $want; }
    -- empty output except for bad data errors --

    What your editor appears to be doing is treating the bytes as cp1252, and treating undefined bytes as the Unicode character with the same codepoint.

    use strict; use warnings; use feature qw( say ); use Encode qw( decode encode_utf8 ); my $have = "\x84\x71\x79\x84\x7B\x7A\x84\xB7" . "\x7B\x84\x8F\x79\x84\xA3\x7F\x84" . "\x8F\x7E"; my $want = "\x{201E}\x{0071}\x{0079}\x{201E}" . "\x{007B}\x{007A}\x{201E}\x{00B7}" . "\x{007B}\x{201E}\x{008F}\x{0079}" . "\x{201E}\x{00A3}\x{007F}\x{201E}" . "\x{008F}\x{007E}"; my $got = decode('cp1252', $have, sub { encode_utf8(chr($_[0])) }); say "match" if $got eq $want;
    match
      how can i solve this issue?
        huh? I presented a working solution in the post to which you replied.
        What issue?