in reply to Re: Decoding bad UTF-16
in thread Decoding bad UTF-16

I switched from "open(FILE, "<:encoding(UTF-8)", $file)" to using decode() on each line in the while(), but now I get "UTF-16:Unrecognised BOM 30 at C:/Perl/lib/Encode.pm line 162" on Line 2 of the file. Line 2 is way before the suspicious char. Any thoughts?

I also tried using USC-2, but I get "illegal unicoded char", which sounds like a legitimate complaint for an encode/decode mismatch.

Next, I'll try the suggested success/fail code, but I don't quite understand it. Does is try multiple encodings?

Thanks for all of the help!

Replies are listed 'Best First'.
Re^3: Decoding bad UTF-16
by moritz (Cardinal) on Sep 25, 2008 at 21:48 UTC
    Why did you have an open(FILE, "<:encoding(UTF-8)", $file) if your file is in UTF-16?

    Your seemingly random trials of various character encodings (UTF-8, UTF-16 (which one? LE?) and UCS-2) let me think that what you really need is to find out what character encoding your file is. The best way is by reading the documentation of the program that created it. Guessing character encodings is bound to fail, especially when there are multiple similar ones.

Re^3: Decoding bad UTF-16
by ikegami (Patriarch) on Sep 25, 2008 at 23:08 UTC

    I switched from "open(FILE, "<:encoding(UTF-8)", $file)" to using decode()

    eh? UTF-8?

    decode and <:encoding are the same thing.

    UTF-16:Unrecognised BOM

    When you specify UTF-16, the file must have a BOM. Specify the actual encoding (UTF-16le or UTF-16be) otherwise.

    I also tried using USC-2, but I get "illegal unicoded char",

    That's not possible. I've just shown you that every possible byte combination is accepted by decode.

    Why bytes causes that, and what encoding did you specify, UCS-2le or UCS-2be?

Re^3: Decoding bad UTF-16
by ikegami (Patriarch) on Sep 25, 2008 at 23:09 UTC

    I switched from "open(FILE, "<:encoding(UTF-8)", $file)" to using decode()

    eh? UTF-8?

    decode and <:encoding are the same thing.

    UTF-16:Unrecognised BOM

    When you specify UTF-16, the file must have a BOM. Specify the actual encoding (UTF-16le or UTF-16be) otherwise.

    I also tried using USC-2, but I get "illegal unicoded char",

    That's not possible. I've just shown you that every possible byte combination is accepted by decode.

    Why bytes causes that, and what encoding did you specify, UCS-2le or UCS-2be?

    Next, I'll try the suggested success/fail code, but I don't quite understand it.

    It demonstrates that all bytes combination work with UCS-2, and since UCS-2 is a very close relative to UTF-16, you'll get further by using that. It's probably what Word uses anyway, since Windows likes to lie about using UTF-16.

      Sorry, UTF-8 was a typo.

      I've submitted unasnwered requests to technical support for the software that generates the logs that I'm reading to tell me the encoding. I may never know it, but that doesn't mean that I get to give-up.

      Using <:encoding(UTF-16) has worked nicely for a couple of months, then suddenly I started having problems (i.e., malformed HI surrogate). I don't care if I have to skip one record, I just don't want Perl to die.

      I switched to decode(UTF-16), however, while Perl doesn't die, it now behaves differently. The output seems to have a space between every char, which I think implies that the encoding is wrong, but why does it work with <:encoding.

      Here are my two programs that I thought would be the same:

      my $file = shift; my $enc = "UTF-16"; open(FILE, "<:encoding($enc)", $file) || die("Can't: $!"); while ( <FILE> ) { print; } close(FILE);
      my $file = shift; my $enc = "UTF-16"; open(FILE,$file) || die("Can't: $!"); while ( <FILE> ) { my $str = decode($enc,$_); print encode($enc,$str); } close(FILE);

      The first technique worked well for a couple of months, but now I'm getting some new chars on which it dies. The second one doesn't die, but I can't regex an of the text. Any other thoughts?

      I appreciate your help thus far. My knowledge is obviously limited with regards to encoding. Any help is much appreciated. Thanks!

        The first program is buggy. You decode without ever encoding. You'd get a "wide character" warning for some inputs if you had warnings on.

        The second program is also buggy, but for a different reason. You presume a line ends at byte 0x10, but that's not true.

        And they output differently. The first outputs a mix of iso-latin-1 and UTF-8. The second program outputs UTF-16le or UTF-16be, probably the latter.

        use strict; use warnings; use open ':std', ':locale'; my $file = shift; my $enc = "UTF-16"; open(my $fh, "<:encoding($enc)", $file) || die("Can't: $!"); while ( <$fh> ) { print; }
        use strict; use warnings; use open ':std', ':locale'; my $file = shift; my $enc = "UTF-16"; my $file = do { open(my $fh, "<:raw", $file) || die("Can't: $!"); local $/; <$fh> }; my $str = decode('UTF-16', $_); print $str;

        The output seems to have a space between every char

        That's usually a sign of UTF-16le/UTF-16be/UCS-2le/UCS-2be being treated as ASCII or a derivative like iso-latin-1 or UTF-8.