desemondo has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I am having trouble with getting text (CRLF specifically) to encode correctly into UTF-16 little endian. Essentially I am expecting this output below:

~~~ Human readable output of what is being generated ~~~~~~~~~~~~ Line1 Line2 Line4 ~~~~~ Actual Results ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4C 00 69 00 6E 00 65 00 31 00 0D 0A 00 4C 00 69 00 6E 00 65 00 32 00 0D 0A 00 0D 0A 00 4C 00 69 00 6E 00 65 00 34 00 0D 0A 00 ~~What was expected and is required for valid UTF-16LE encoding ~~~ 4C 00 69 00 6E 00 65 00 31 00 0D 00 0A 00 ^ byte missing from actual results 4C 00 69 00 6E 00 65 00 32 00 0D 00 0A 00 0D 00 0A 00 ^ byte missing from actual results ^ byte missing from actual results 4C 00 69 00 6E 00 65 00 34 00 0D 00 0A 00 ^ byte missing from actual results ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I suspect this issue (or bug in Encode.pm?) may be due to \n being mappped to CRLF on windows whereas in *nix its just LF and Encode.pm and it's dependancies aren't handling that correctly.

I have tried numerous things, eg. using BE, UCS-2LE/BE, using \015\012 instead of \n - all seem to have the same issue.

  1. Is this a bug, or am I doing something wrong?
  2. Assuming I'm not doing something wrong, is there any way to code around this issue in Perl 5.8.8 (Encode.pm v2.23)?
I'm retesting this on Perl 5.10.1 currently and will update with results. Any assistance or advice would be much appreciated.

Update:
Issue also reproducable on Perl 5.10.1. Am I correct in thinking this is a bug with the Encode::Unicode?
Can anyone think of any alternatives to what Anonymous Monk suggested? I appreciate any and all feedback. Thanks

Update2:

Issue resolved. Key points from this experience:

(Code to reproduce is in the Readmore)
use strict; use warnings; use Encode qw(encode decode); ### Actual Results my $string = "Line1\nLine2\n\nLine4\n"; open (my $output_fh, ">:encoding(utf-16le)", 'Test_reg.reg') || die "Unable to create reg output file. $!"; print {$output_fh} $string ; ### something else I tried, also doesn't work correctly. my $string2 = "Line1\015\012Line2\015\012\015\012Line4\015\012"; open (my $output_fh2, ">:encoding(utf-16le)", 'Test_reg2.reg') || die "Unable to create reg output file. $!"; print {$output_fh2} $string2 ;

Replies are listed 'Best First'.
Re: CRLF not encoding into UTF-16LE correctly on ActivePerl 5.8.8
by ikegami (Patriarch) on Feb 15, 2010 at 06:41 UTC
Re: CRLF not encoding into UTF-16LE correctly on ActivePerl 5.8.8
by 7stud (Deacon) on Feb 15, 2010 at 05:32 UTC

    Encoding the strings manually and then outputting them with syswrite() works too:

    use strict; use warnings; use 5.010; use Encode qw{encode}; my $string = "Line1\015\012Line2\015\012\015\012Line4\015\012"; my $utf16_string = encode('UTF16-LE', $string); open my $OUTPUT_FH, '>', 'data1.txt' or die "Unable to open data1.txt: $!"; syswrite $OUTPUT_FH, $utf16_string ; --hex output:-- 4C 00 69 00 6E 00 65 00 31 00 0D 00 0A 00 4C 00 69 00 6E 00 65 00 32 00 0D 00 0A 00 0D 00 0A 00 4C 00 69 00 6E 00 65 00 34 00 0D 00 0A 00
Re: CRLF not encoding into UTF-16LE correctly on ActivePerl 5.8.8
by Anonymous Monk on Feb 15, 2010 at 03:02 UTC
    crlf is a layer, so you could try :crlf:encoding(UTF-16LE)
      Thanks for the pointer, I wasn't aware of that one. I've found that using perlio in place of crlf allows the 2nd solution (using \015\012 ) to now be possible, though I'd rather not have to use raw octals if it can be avoided...

      when I get the problem layers are - "unix crlf encoding(UTF-16LE) utf8"
      when I set set PERLIO=perlio layers are "unix perlio encoding(UTF-16LE) utf8" , and 0D is correctly followed by 00 when using raw octals, \015\012 .