Decoding bad UTF-16

gregality has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Decoding bad UTF-16 by moritz (Cardinal) on Sep 25, 2008 at 18:25 UTC
Use Encode`::decode` to decode your text. The third argument to `decode()` determines what happens when a mal-formed character is to be converted.	[reply] [d/l] [select]
Re: Decoding bad UTF-16 by ikegami (Patriarch) on Sep 25, 2008 at 20:31 UTC
It's probably UCS-2, what Windows often calls UTF-16. Even if it's not, it's a convenient way of avoiding the problem. `use strict; use warnings; use Encode qw( decode ); my $s = ''; $s .= pack('v', $_) for 0..65535; for my $enc (qw( UCS-2le UTF-16le )) { printf( "%-9s %s\n", "$enc:", eval { decode($enc, $s); 'success' } \|\| 'error' ); }` [download] `UCS-2le: success UTF-16le: error` [download]	[reply] [d/l] [select]
Re^2: Decoding bad UTF-16 by gregality (Initiate) on Sep 25, 2008 at 21:18 UTC
I switched from "open(FILE, "<:encoding(UTF-8)", $file)" to using decode() on each line in the while(), but now I get "UTF-16:Unrecognised BOM 30 at C:/Perl/lib/Encode.pm line 162" on Line 2 of the file. Line 2 is way before the suspicious char. Any thoughts? I also tried using USC-2, but I get "illegal unicoded char", which sounds like a legitimate complaint for an encode/decode mismatch. Next, I'll try the suggested success/fail code, but I don't quite understand it. Does is try multiple encodings? Thanks for all of the help!	[reply]
Re^3: Decoding bad UTF-16 by moritz (Cardinal) on Sep 25, 2008 at 21:48 UTC
Why did you have an `open(FILE, "<:encoding(UTF-8)", $file)` if your file is in UTF-16? Your seemingly random trials of various character encodings (UTF-8, UTF-16 (which one? LE?) and UCS-2) let me think that what you really need is to find out what character encoding your file is. The best way is by reading the documentation of the program that created it. Guessing character encodings is bound to fail, especially when there are multiple similar ones.	[reply] [d/l]
Re^3: Decoding bad UTF-16 by ikegami (Patriarch) on Sep 25, 2008 at 23:08 UTC
I switched from "open(FILE, "<:encoding(UTF-8)", $file)" to using decode() eh? `UTF-8`? `decode` and `<:encoding` are the same thing. UTF-16:Unrecognised BOM When you specify `UTF-16`, the file must have a BOM. Specify the actual encoding (`UTF-16le` or `UTF-16be`) otherwise. I also tried using USC-2, but I get "illegal unicoded char", That's not possible. I've just shown you that every possible byte combination is accepted by `decode`. Why bytes causes that, and what encoding did you specify, `UCS-2le` or `UCS-2be`?	[reply] [d/l] [select]
Re^3: Decoding bad UTF-16 by ikegami (Patriarch) on Sep 25, 2008 at 23:09 UTC
I switched from "open(FILE, "<:encoding(UTF-8)", $file)" to using decode() eh? `UTF-8`? `decode` and `<:encoding` are the same thing. UTF-16:Unrecognised BOM When you specify `UTF-16`, the file must have a BOM. Specify the actual encoding (`UTF-16le` or `UTF-16be`) otherwise. I also tried using USC-2, but I get "illegal unicoded char", That's not possible. I've just shown you that every possible byte combination is accepted by `decode`. Why bytes causes that, and what encoding did you specify, `UCS-2le` or `UCS-2be`? Next, I'll try the suggested success/fail code, but I don't quite understand it. It demonstrates that all bytes combination work with UCS-2, and since UCS-2 is a very close relative to UTF-16, you'll get further by using that. It's probably what Word uses anyway, since Windows likes to lie about using `UTF-16`.	[reply] [d/l] [select]
Re^4: Decoding bad UTF-16 by gregality (Initiate) on Sep 29, 2008 at 20:39 UTC
Re^5: Decoding bad UTF-16 by ikegami (Patriarch) on Sep 29, 2008 at 21:07 UTC