Ecurb has asked for the wisdom of the Perl Monks concerning the following question:

I'm a novice at Perl.

Ran into a problem reading an idrive.com log file on Windows XP computer.

My solution looks like:
my $logopen = open LOGFILE,"<:encoding(UCS-2LE)", $file; seek LOGFILE, 2, 0; local $/ = undef; my $lines = <LOGFILE>; print '<li>'; if ( $lines =~ /Backup Completed/ ) { print "IDrive Backup Completed: "; } else { print "<b>Backup FAILED:</b> "; } print $file . "</li>\n";
According to notepad++ the IDrive log file was UCS-2 LE. Perl would not slurp the file into a string until I specifically used ":encoding(UCS-2LE)" and until I skipped past the first two bytes of the file. If I did not skip the first two byes I would get a "wide character warning" and "$lines = <LOGFILE>" would only capture a few characters out of a 1088 character file.

I'm still wondering:
1. If Perl should have handled the UCS-2LE file without needing to include the encoding or the skipping of bytes
2. If the IDrive log files might be a non-standard or corrupted UCS-2LE

Bruce

Replies are listed 'Best First'.
Re: Problems Handling UCS-2LE
by ikegami (Patriarch) on May 21, 2009 at 18:35 UTC

    The bytes you are skipping form character U+FEFF, the Byte-order mark. Use UCS-2 instead of UCS-2le and it will skip the character for you.

    The wide character warning is issued because you outputting decoded characters without encoding them (loosely speaking). Fix:

    # Encode output. # Use the encoding that's appropriate for you. binmode STDOUT, ':encoding(UTF-8)'; my $lines; { # Decode input. open my $log_fh, "<:encoding(UCS-2)", $file or die($!); local $/ = undef; $lines = <LOGFILE>; } print "...\n", $lines, "...\n";

    On unix, you can do use open ':std', ':locale'; to set the "correct" encoding for STDOUT, but it doesn't work on Windows :(

    If I did not skip the first two byes, [...] "$lines = <LOGFILE>" would only capture a few characters out of a 1088 character file.

    You are mistaken.

    If Perl should have handled the UCS-2LE file without needing to include the encoding or the skipping of bytes

    Perl has no way of knowing the encoding of a file, or even if it's a text file for that matter.

    If the IDrive log files might be a non-standard or corrupted UCS-2LE

    Why do you ask that?

    There are some byte combination that aren't allowed in UCS-2*. Encountering them is fatal.

    $ perl -e'open $fh, "<:encoding(UCS-2le)", \"\x00\xD8"; <$fh>' UCS-2LE:no surrogates allowed d800 at -e line 1.
      I tried:
      my $logopen = open LOGFILE,"<:encoding(UCS-2)", $file; local $/ = undef; my $lines = <LOGFILE>;
      The last line generates the error message:"UCS-2BE:Unicode character fffe is illegal"
      Thanks for any other thoughts on what I might be doing wrong.
      Bruce

        Indeed! That shouldn't happen. Or at the very least, it's inconsistent with UTF-16.

        $ perl -le'open $fh, "<:encoding(UTF-16le)", \"\xFF\xFE"; print length + <$fh>' 1 $ perl -le'open $fh, "<:encoding(UTF-16)", \"\xFF\xFE"; print length < +$fh>' 0 $ perl -le'open $fh, "<:encoding(UCS-2le)", \"\xFF\xFE"; print length +<$fh>' 1 $ perl -le'open $fh, "<:encoding(UCS-2)", \"\xFF\xFE"; print length <$ +fh>' UCS-2BE:Unicode character fffe is illegal at -e line 1.

        Using File::BOM or the following would be a better solutions than skipping the first two bytes.

        $lines =~ s/\x{FEFF}//g;
        BTW, here is the log file I am trying to read (1291 characters). Only IDrive.com log files had this issue, all the other log files work fine.

        IDrive UCS2 Log File

        Thanks
        Bruce