Problems Handling UCS-2LE

Ecurb has asked for the wisdom of the Perl Monks concerning the following question:

I'm a novice at Perl.

Ran into a problem reading an idrive.com log file on Windows XP computer.

My solution looks like:

my $logopen = open LOGFILE,"<:encoding(UCS-2LE)", $file;
seek LOGFILE, 2, 0;
local $/ = undef;
my $lines = <LOGFILE>;
print '<li>';
if ( $lines =~ /Backup Completed/ )
{
  print "IDrive Backup Completed: ";
}
else
{
  print "<b>Backup FAILED:</b> ";
}
print $file . "</li>\n";
[download]

According to notepad++ the IDrive log file was UCS-2 LE. Perl would not slurp the file into a string until I specifically used ":encoding(UCS-2LE)" and until I skipped past the first two bytes of the file. If I did not skip the first two byes I would get a "wide character warning" and "$lines = <LOGFILE>" would only capture a few characters out of a 1088 character file.

I'm still wondering:
1. If Perl should have handled the UCS-2LE file without needing to include the encoding or the skipping of bytes
2. If the IDrive log files might be a non-standard or corrupted UCS-2LE

Bruce

Comment on Problems Handling UCS-2LE Download Code

Replies are listed 'Best First'.
Re: Problems Handling UCS-2LE by ikegami (Patriarch) on May 21, 2009 at 18:35 UTC
The bytes you are skipping form character U+FEFF, the Byte-order mark. Use `UCS-2` instead of `UCS-2le` and it will skip the character for you. The wide character warning is issued because you outputting decoded characters without encoding them (loosely speaking). Fix: `# Encode output. # Use the encoding that's appropriate for you. binmode STDOUT, ':encoding(UTF-8)'; my $lines; { # Decode input. open my $log_fh, "<:encoding(UCS-2)", $file or die($!); local $/ = undef; $lines = <LOGFILE>; } print "...\n", $lines, "...\n";` [download] On unix, you can do `use open ':std', ':locale';` to set the "correct" encoding for STDOUT, but it doesn't work on Windows :( If I did not skip the first two byes, [...] "`$lines = <LOGFILE>`" would only capture a few characters out of a 1088 character file. You are mistaken. If Perl should have handled the UCS-2LE file without needing to include the encoding or the skipping of bytes Perl has no way of knowing the encoding of a file, or even if it's a text file for that matter. If the IDrive log files might be a non-standard or corrupted UCS-2LE Why do you ask that? There are some byte combination that aren't allowed in UCS-2*. Encountering them is fatal. `$ perl -e'open $fh, "<:encoding(UCS-2le)", \"\x00\xD8"; <$fh>' UCS-2LE:no surrogates allowed d800 at -e line 1.` [download]	[reply] [d/l] [select]
Re^2: Problems Handling UCS-2LE by Ecurb (Initiate) on May 23, 2009 at 19:18 UTC
I tried: `my $logopen = open LOGFILE,"<:encoding(UCS-2)", $file; local $/ = undef; my $lines = <LOGFILE>;` [download] The last line generates the error message:"UCS-2BE:Unicode character fffe is illegal" Thanks for any other thoughts on what I might be doing wrong. Bruce	[reply] [d/l]
Re^3: Problems Handling UCS-2LE by ikegami (Patriarch) on May 25, 2009 at 17:24 UTC
Indeed! That shouldn't happen. Or at the very least, it's inconsistent with UTF-16. `$ perl -le'open $fh, "<:encoding(UTF-16le)", \"\xFF\xFE"; print length + <$fh>' 1 $ perl -le'open $fh, "<:encoding(UTF-16)", \"\xFF\xFE"; print length < +$fh>' 0 $ perl -le'open $fh, "<:encoding(UCS-2le)", \"\xFF\xFE"; print length +<$fh>' 1 $ perl -le'open $fh, "<:encoding(UCS-2)", \"\xFF\xFE"; print length <$ +fh>' UCS-2BE:Unicode character fffe is illegal at -e line 1.` [download] Using File::BOM or the following would be a better solutions than skipping the first two bytes. `$lines =~ s/\x{FEFF}//g;` [download]	[reply] [d/l] [select]
Re^3: Problems Handling UCS-2LE by Ecurb (Initiate) on May 24, 2009 at 02:18 UTC
BTW, here is the log file I am trying to read (1291 characters). Only IDrive.com log files had this issue, all the other log files work fine. IDrive UCS2 Log File Thanks Bruce	[reply]
Re^4: Problems Handling UCS-2LE by ikegami (Patriarch) on May 25, 2009 at 17:38 UTC
Re^5: Problems Handling UCS-2LE by Ecurb (Initiate) on Jun 05, 2009 at 17:01 UTC