comment on

The problem is that byte 8F is not defined in cp1252, so what you have isn't (valid) cp1252.

Byte cp1252   Want                                Ok?
---- ------   ----------------------------------  ---
84   U+201E   U+201E DOUBLE LOW-9 QUOTATION MARK  yes
71   U+0071   U+0071 LATIN SMALL LETTER Q         yes
79   U+0079   U+0079 LATIN SMALL LETTER Y         yes
84   U+201E   U+201E DOUBLE LOW-9 QUOTATION MARK  yes
7B   U+007B   U+007B LEFT CURLY BRACKET           yes
7A   U+007A   U+007A LATIN SMALL LETTER Z         yes
84   U+201E   U+201E DOUBLE LOW-9 QUOTATION MARK  yes
B7   U+00B7   U+00B7 MIDDLE DOT                   yes
7B   U+007B   U+007B LEFT CURLY BRACKET           yes
84   U+201E   U+201E DOUBLE LOW-9 QUOTATION MARK  yes
8F   ------   U+008F SINGLE SHIFT THREE           NO!
79   U+0079   U+0079 LATIN SMALL LETTER Y         yes
84   U+201E   U+201E DOUBLE LOW-9 QUOTATION MARK  yes
A3   U+00A3   U+00A3 POUND SIGN                   yes
7F   U+007F   U+007F DELETE                       yes
84   U+201E   U+201E DOUBLE LOW-9 QUOTATION MARK  yes
8F   ------   U+008F SINGLE SHIFT THREE           NO!
7E   U+007E   U+007E TILDE                        yes
[download]

It doesn't appear to be any other encoding either.

use strict;
use warnings;
use feature qw( say );

use Encode qw( decode encode );

my $have = "\x84\x71\x79\x84\x7B\x7A\x84\xB7"
         . "\x7B\x84\x8F\x79\x84\xA3\x7F\x84"
         . "\x8F\x7E";
my $want = "\x{201E}\x{0071}\x{0079}\x{201E}"
         . "\x{007B}\x{007A}\x{201E}\x{00B7}"
         . "\x{007B}\x{201E}\x{008F}\x{0079}"
         . "\x{201E}\x{00A3}\x{007F}\x{201E}"
         . "\x{008F}\x{007E}";

for (Encode->encodings(':all')) {
    my $got;
    if (!eval { $got = decode($_, $have); 1 }) {
        warn $@;
        next;
    }

    say if $got eq $want;
}
[download]

-- empty output except for bad data errors --
[download]

What your editor appears to be doing is treating the bytes as cp1252, and treating undefined bytes as the Unicode character with the same codepoint.

use strict;
use warnings;
use feature qw( say );

use Encode qw( decode encode_utf8 );

my $have = "\x84\x71\x79\x84\x7B\x7A\x84\xB7"
         . "\x7B\x84\x8F\x79\x84\xA3\x7F\x84"
         . "\x8F\x7E";
my $want = "\x{201E}\x{0071}\x{0079}\x{201E}"
         . "\x{007B}\x{007A}\x{201E}\x{00B7}"
         . "\x{007B}\x{201E}\x{008F}\x{0079}"
         . "\x{201E}\x{00A3}\x{007F}\x{201E}"
         . "\x{008F}\x{007E}";

my $got = decode('cp1252', $have, sub { encode_utf8(chr($_[0])) });
say "match" if $got eq $want;
[download]

match
[download]

In reply to Re^7: unicode in perl by ikegami
in thread unicode in perl by paramjit

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.