in reply to Re^2: Unicode2ascii
in thread Unicode2ascii

jbert has provided a good link.. ;-) from a quick glance I guess notepad's Unicode means UTF-16LE.

--shmem

_($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                              /\_¯/(q    /
----------------------------  \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

Replies are listed 'Best First'.
Re^4: Unicode2ascii
by ikegami (Patriarch) on Nov 28, 2006 at 14:51 UTC
    It's UCS-2LE, the fixed-width variant of UTF-16LE.
    use strict; use warnings; my $file_in = '...'; my $file_out = '...'; open(my $fh_in, '<:raw:encoding(UCS-2LE)', $file_in) or die("Unable to open \"$file_in\": $!\n"); open(my $fh_out, '>:raw:encoding(UCS-2LE)', $file_out) or die("Unable to create file \"$file_out\": $!\n"); while (<$fh_in>) { ... print $fh_out $_; }

    Update: Oops, originally confirmed that it was UTF-16LE.

      Glad you corrected this - I'm not that proficient on Windows ;-)

      I wonder about the leading sequence 0xff 0xfe in notepad saved text files - is that some marker indicating the encoding type?

      --shmem

      _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                    /\_¯/(q    /
      ----------------------------  \__(m.====·.(_("always off the crowd"))."·
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

        Glad you corrected this - I'm not that proficient on Windows ;-)

        UCS-2 and UTF-16 are practically identical. The former is fixed width (like iso-latin-1) and the latter is variable width (like UTF-8). The pros and cons for using UTF-8 over iso-latin-1 also apply to using UTF-16 over UCS-2.

        Windows uses UCS-2LE. Not knowing anything about UCS-2 'til today, I've been blindly using UTF-16LE.

        I wonder about the leading sequence 0xff 0xfe in notepad saved text files - is that some marker indicating the encoding type?

        It's a Byte Order Mark (BOM).

        I wonder about the leading sequence 0xff 0xfe in notepad saved text files - is that some marker indicating the encoding type?

        It's called the "byte order mark" (BOM), and is used to detect the little or big-endianness (?) of the data in the file.

        See BOM.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.