Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Unicode strangeness

by Odud (Pilgrim)
on Oct 15, 2005 at 21:22 UTC ( #500502=perlquestion: print w/replies, xml ) Need Help??

Odud has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to produce output in ucs2le. What I've got is:
use characters ':full'; open(PLP,">:encoding(ucs2le)","test.plp"); print PLP "PLP PLAYLIST"; print PLP "\N{CARRIAGE RETURN}\N{LINE FEED}";
I expect to see the following at the end of the output 0d00 0a00 but instead I'm getting 0d00 0d0a 00 i.e. it looks as though there is a spurious 0d getting in there. If I replace LINE FEED by WHITE SMILING FACE (say) then I get four bytes 0d00 3a26 which is as expected - so it looks as though LINE FEED specifically is causing the problem. I'm running ActiveState 5.8.7 on XP (SP2 etc...)

Replies are listed 'Best First'.
Re: Unicode strangeness
by pg (Canon) on Oct 15, 2005 at 21:50 UTC

    Here is one solution:

    use strict; use warnings; use charnames ':full'; open(my $fh, ">:raw", "test.plp"); binmode($fh, ":encoding(ucs2le)"); print $fh "\r\n"; close $fh; #test open(PLP,"<:encoding(ucs2le)","test.plp"); my $string; sysread(PLP, $string, 100); printf("0x%02x ", ord($_)) for (split //, $string);

    This prints:

    0x0d 0x00 0x0a 0x00

    Your original problem is due to the fact that there is a :crlf layer, and the sequence between layers.

      Thanks for that pg. I've gone with your solution. Thanks also to the other monks for their suggestions. The application that I'm producing the file for expects an exact format/sequence of carriage returns and newlines. The C code that I took the basic ideas from just opened the file in bin mode and then output a null byte after every character to make it look as though it was writing Unicode...
Re: Unicode strangeness
by graff (Chancellor) on Oct 15, 2005 at 22:14 UTC
    pg is right. I don't have a windows machine to try it on, but it would seem that when you use the mode spec ">:encoding(ucs2le)" in the open call, this might get appended after the default Windows ":crlf" mode.

    Another way to try would be one of the following (I'm not sure which because again, I don't have a windows box to try it on):

    # either this: open( my $fh, ">:encoding(ucs2le):crlf", "filename" ); # or if that doesn't work, then this: open( my $fh, ">:raw:encoding(ucs2le):crlf", "filename" );

    In either case, by putting ":crlf" after the encoding spec, the crlf layer (converting "\n" in your code to "\r\n" on output) will create proper 16-bit renderings of the CR and LF characters (0d 00 0a 00).

    It does seem unfortunate that this is not the default behavior.

    (updated to fix spelling error in code sample)

      Tested on Windows XP. Neither worked. However the thought is definitely very decent. I probably know where your thought came from: in the old days, :raw reverses :crlf, but it no longer does.

      use strict; use warnings; use charnames ':full'; open( my $fh, ">:raw:encoding(ucs2le):crlf", "test.plp" ); print $fh "\N{CARRIAGE RETURN}\N{LINE FEED}"; close $fh; #test open(PLP,"<","test.plp"); my $string; sysread(PLP, $string, 100); printf("0x%02x ", ord($_)) for (split //, $string);

      This prints:

      0x0d 0x00 0x0d 0x00 0x0a 0x00
      use strict; use warnings; use charnames ':full'; open( my $fh, ">:encoding(ucs2le):crlf", "test.plp" ); print $fh "\N{CARRIAGE RETURN}\N{LINE FEED}"; close $fh; #test open(PLP,"<","test.plp"); my $string; sysread(PLP, $string, 100); printf("0x%02x ", ord($_)) for (split //, $string);

      This prints:

      0x0d 0x00 0x0d 0x0a 0x00
        The whole point of ":crlf" mode is that, when you say "\n" (LINE_FEED) in your code, perl interprets that to mean "newline event", which by definition comes out as "CARRIAGE_RETURN LINE_FEED" (hence the name ":crlf" mode); when you use this mode, you would never explicitly print a "\r" (carriage return) to such a file handle, unless you really want an "extra" carriage return in the output.

        OTOH, you can leave off ":crlf", explicitly print "\r" wherever/whenever you want, and not get them added automatically when you print "\n".

        Since you seem fixated on explicitly printing the carriage returns yourself, and not having them added automatically to every line feed that you print, just leave out ":crlf".

        Based on the tests you've shown, it is essential in any case to make sure the mode begins with ":raw". Without this, the default (actually implicit) ":crlf" mode will somehow be treated in the wrong sequence relative to the ucs2le mode, and the "crlf" sequence does not get converted to a valid sequence of two 16-bit unicode characters. In terms of the code you're showing:

        ## instead of this: open( my $fh, ">:raw:encoding(ucs2le):crlf", "testa.plp" ); print $fh "\N{CARRIAGE RETURN}\N{LINE FEED}"; close $fh; ## you want either this: open( my $fh, ">:raw:encoding(ucs2le)", "testb.plp" ); print $fh "\N{CARRIAGE RETURN}\N{LINE FEED}"; close $fh; ## or this: open( my $fh, ">:raw:encoding(ucs2le):crlf", "testc.plp" ); print $fh "\N{LINE FEED}"; # :crlf adds CARRIAGE RETURN for you close $fh;
        Just for the sake of parsimony and lower probability of screwing things up, I'd prefer the last approach, personally.
Re: Unicode strangeness
by pg (Canon) on Oct 16, 2005 at 00:52 UTC

    There is a way to make your code work without modifying it. Just set environment variable PERLIO to perlio:

    set PERLIO=perlio

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://500502]
Approved by blokhead
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2023-01-31 11:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?