in reply to Re^2: different length of a line from linux and windows textfile? (seek)
in thread different length of a line from linux and windows textfile?

"\n" is a newline. One character that's represented by two bytes (CR LF) in windows text files, one byte (LF) in unix text files, and one byte (CR) in macos9 text files. That's all I was saying.

(Also, unix input devices usually convert CR to LF.)

"\r\n" isn't really a thing. It's either CR LF or LF CR. If you really need a CR LF (as in some network protocols), you should probably write it as "\x0d\x0a".

But we're kind of splitting hairs here.

  • Comment on Re^3: different length of a line from linux and windows textfile? (seek)

Replies are listed 'Best First'.
Re^4: different length of a line from linux and windows textfile? (heirs)
by tye (Sage) on Mar 17, 2014 at 22:07 UTC

    You correct somebody for confusion between chars vs bytes but then you get things wrong about number of chars and/or number of bytes involved yet your mistake is just 'splitting hairs'?

    "Newline is one character but two bytes on Windows" is not in the realm of splitting hairs. There is nothing you can reasonably choose to call "newline" that will give a length() of 1 but a bytes::length() of 2.

    You can choose to think of "\n" as a "newline" (or even "the newline character"). You can also (separately) choose to think of "newline" as a concept that varies between platforms (or parts of platforms). But doing those things are very likely, in my experience, to lead you into factual errors that aren't splitting hairs, as indeed happened to you. And you certainly shouldn't combine thinking of "\n" as "the newline character" with thinking of "newline" as something that varies between platforms (because "end of record" for text files can be 1 character, 2 characters, or 0 characters).

    Those choices often lead people into thinking (or at least saying) that "\n" is (sometimes) two bytes (or two characters). Yet length("\n") == 1 on all Perl platforms and bytes::length("\n") == 1 on all Perl platforms.

    You can somewhat reasonably declare that, under Windows, "newline" is sometimes a 1-byte, 1-character string (in Perl strings) and is sometimes a 2-byte, 2-character string (in text files). But referring to such a conceptual "newline" via the name "\n" is a huge mistake, IMHO and IME. And saying "newline" is a 2-byte, 1-character thing is just flatly incorrect. Windows "newline" is never a single multi-byte character.

    What (normally) gets written to a text file in Windows is bit-for-bit equal to the Perl string "\r\n" (on any ASCII platform, which excludes ancient MacOS which stupidly decided to be "not quite ASCII" for the sake of trying to avoid the need for binmode). So if you define "\n" as "newline that is actually CR followed by LF when in files on Windows", then surely "\r\n" must be CR CR LF. What a confusing mental framework to try to deal with. No wonder you confused yourself to the point of misspeaking.

    Using "\x0d\x0a" is an improvement only on ancient MacOS and that is only because ancient MacOS did some stupid things in incorrectly defining "\n" as carriage Return. In other Perl environments, it is more correct to use "\n" or "\r\n".

    I find it hilarious to read (for example, in perlport) about how "\n" is a "logical" thing (can't say "character") and, when talking to a socket, you need to replace it with "\x0a" (or with "\x0c\x0a"). That almost makes sense, but only if you also declare that "H" is also a "logical" character and you need to replace it with "\x48" when talking to a socket. The motivation for this mindset is to try to fit ancient MacOS's stupid mistake into a mental framework while mostly ignoring non-ASCII Perl systems.

    Many, perhaps even 'most', protocols over sockets assume ASCII. If you are trying to implement SMTP over a socket with Perl and you write "HELO\x0d\x0a", then you have written code that is never going to be correct on a non-ASCII system (other than the near-ASCII ancient MacOS). If you write "HELO\r\n", then you have written code that is likely correct for all (non-ancient-MacOS) Perl systems, including an EBCDIC system that includes an ASCII translation layer in front of sockets.

    Heck, even VMS Perl ends up with a thin translation layer in front of sockets. That's why CGI.pm knows to use "\n" not "\x0d\x0a" under VMS. CGI.pm also proves my point about EBCDIC systems:

    if ($OS eq 'VMS') { $CRLF = "\n"; } elsif ($EBCDIC) { $CRLF= "\r\n"; } else { $CRLF = "\015\012"; }

    A saner version (with identical functionality) of that would actually be:

    if ($OS eq 'VMS') { $CRLF = "\n"; } elsif ($ANCIENT_MACOS) { # $CRLF = "\015\012"; $CRLF = "\n\r"; # Same thing } else { $CRLF= "\r\n"; }
    And bytes::length("\n") is of course the "byte representation" of newline, not the "text file representation."

    So binmode causes your Perl string to end up with the "character representation" of "\n" (which isn't "\n") while w/o binmode you get the "byte representation"? Stop convoluting your mental model so much. Just declare MacOS an outlier and move to a sane mental model where "\n" is just "\n".

    But if you don't, then you have to be very, very careful to never refer to "\n" as "a character" (and yet also be very clear that, inside of Perl, "\n" is always just one character). Good luck with that.

    - tye        

      "Newline is one character but two bytes on Windows"

      I clearly said "in a Windows text file" in both of my posts. I didn't read past this point in your rant, and I am done with you.

      Incidentally, the only Windows application that still seems to care about CRLFs is Notepad. The Windows/UNIX distinction is slowly dying.

        I clearly said "in a Windows text file" in both of my posts.

        Even in a Windows text file, newline is never simultaneously 1 character and two bytes.

        I didn't read past this point in your rant, and I am done with you.

        But such mistakes aren't a surprise when your reaction to a correction is to minimize your mistake and then to bury your head in the sand. :)

        Confusion on these points is quite common and often leads to actual bugs (like in the head of this thread). Just go look it how many people are insisting that 'chomp strips "\r" characters'.

        I think perlport does a great job of keeping this confusion high.

        - tye