BillKSmith has asked for the wisdom of the Perl Monks concerning the following question:

I am using ActivePerl on Windows 10, and having problelms reading a file which I downloaded from an attachment on the "Perl Guru" forum. (The attachment has since been removed.) This is the first time I have ever had to deal with unicode. The OP specified an encoding(utf-16). The file uses the windows newline convention of CR/LF. Each of these characters is encoded as a 2-byte character. The perl operator <> reads the text correctly, but it returns a "\r\n" at the end of every line instead of "\n". This is a problem because messages can be spread over several lines. They are separated by 'blank' lines. Reading messages by setting $/ to the null string does not work because the 'blank' lines are not blank (They contain only that nasty "\r"). As a work-arournd, I have been able to set $/="\n\r\n". My question is "How can I make perl interpret the newline sequence correctly?" The following code demonstrates the problem by printing the ordinal of the second last character of the first line. It is a 13 (carriage return). The length (43) of the line is two more than the number of printed characters. Sorry, they are hard to count because of the way this forum displays the \cA near the middle of the line.
use strict; use warnings; open(my $in, "<:encoding(UTF-16)", "INPUT.TXT" ) || die("Error open INPUT.TXT\n"); my $first_line = <$in>; my $length_of_line = length $first_line; my $second_last_character = substr $first_line, -2; print $first_line; print $length_of_line, ' ', ord($second_last_character), "\n"; close $in; OUTPUT: 24.07.2016 18:26:19.171 [>] &#9786;?;20;0;37;0; 43 13 For reference, here is a hex dump of the first few lines of the file. (reposted with permission) 0000000: fffe 3200 3400 2e00 3000 3700 2e00 3200 ..2.4...0.7...2. 0000010: 3000 3100 3600 2000 3100 3800 3a00 3200 0.1.6. .1.8.:.2. 0000020: 3600 3a00 3100 3900 2e00 3100 3700 3100 6.:.1.9...1.7.1. 0000030: 2000 5b00 3e00 5d00 2000 0100 3f00 3b00 .[.>.]. ...?.;. 0000040: 3200 3000 3b00 3000 3b00 3300 3700 3b00 2.0.;.0.;.3.7.;. 0000050: 3000 3b00 0d00 0a00 0d00 0a00 fffe 3200 0.;...........2.
Bill

Replies are listed 'Best First'.
Re: Windows newlines in unicode
by haukex (Archbishop) on Sep 17, 2016 at 16:21 UTC

    Hi BillKSmith,

    Try adding the :crlf layer, i.e. open(my $in, "<:encoding(UTF-16):crlf", "INPUT.TXT"). I did a quick test and it works for me; CRLFs are converted to \n and paragraph mode ($/ = '';) works too.

    Hope this helps,
    -- Hauke D

      That did it. Thanks a lot. I had not tried that because I had assumed that was the default under windows therefore not worth the trouble to look up the syntax. Is it always necessary to explicitly specify the crlf layer when using Unicode on windows?
      Bill

        Hi BillKSmith,

        Glad to help. From open:

        Note that if layers are specified in the three-argument form, then default layers ... are ignored. Those layers will also be ignored if you specifying [sic] a colon with no name following it. In that case the default layer for the operating system (:raw on Unix, :crlf on Windows) is used.

        Hope this helps,
        -- Hauke D