You correct somebody for confusion between chars vs bytes but then you get things wrong about number of chars and/or number of bytes involved yet your mistake is just 'splitting hairs'?
"Newline is one character but two bytes on Windows" is not in the realm of splitting hairs. There is nothing you can reasonably choose to call "newline" that will give a length() of 1 but a bytes::length() of 2.
You can choose to think of "\n" as a "newline" (or even "the newline character"). You can also (separately) choose to think of "newline" as a concept that varies between platforms (or parts of platforms). But doing those things are very likely, in my experience, to lead you into factual errors that aren't splitting hairs, as indeed happened to you. And you certainly shouldn't combine thinking of "\n" as "the newline character" with thinking of "newline" as something that varies between platforms (because "end of record" for text files can be 1 character, 2 characters, or 0 characters).
Those choices often lead people into thinking (or at least saying) that "\n" is (sometimes) two bytes (or two characters). Yet length("\n") == 1 on all Perl platforms and bytes::length("\n") == 1 on all Perl platforms.
You can somewhat reasonably declare that, under Windows, "newline" is sometimes a 1-byte, 1-character string (in Perl strings) and is sometimes a 2-byte, 2-character string (in text files). But referring to such a conceptual "newline" via the name "\n" is a huge mistake, IMHO and IME. And saying "newline" is a 2-byte, 1-character thing is just flatly incorrect. Windows "newline" is never a single multi-byte character.
What (normally) gets written to a text file in Windows is bit-for-bit equal to the Perl string "\r\n" (on any ASCII platform, which excludes ancient MacOS which stupidly decided to be "not quite ASCII" for the sake of trying to avoid the need for binmode). So if you define "\n" as "newline that is actually CR followed by LF when in files on Windows", then surely "\r\n" must be CR CR LF. What a confusing mental framework to try to deal with. No wonder you confused yourself to the point of misspeaking.
Using "\x0d\x0a" is an improvement only on ancient MacOS and that is only because ancient MacOS did some stupid things in incorrectly defining "\n" as carriage Return. In other Perl environments, it is more correct to use "\n" or "\r\n".
I find it hilarious to read (for example, in perlport) about how "\n" is a "logical" thing (can't say "character") and, when talking to a socket, you need to replace it with "\x0a" (or with "\x0c\x0a"). That almost makes sense, but only if you also declare that "H" is also a "logical" character and you need to replace it with "\x48" when talking to a socket. The motivation for this mindset is to try to fit ancient MacOS's stupid mistake into a mental framework while mostly ignoring non-ASCII Perl systems.
Many, perhaps even 'most', protocols over sockets assume ASCII. If you are trying to implement SMTP over a socket with Perl and you write "HELO\x0d\x0a", then you have written code that is never going to be correct on a non-ASCII system (other than the near-ASCII ancient MacOS). If you write "HELO\r\n", then you have written code that is likely correct for all (non-ancient-MacOS) Perl systems, including an EBCDIC system that includes an ASCII translation layer in front of sockets.
Heck, even VMS Perl ends up with a thin translation layer in front of sockets. That's why CGI.pm knows to use "\n" not "\x0d\x0a" under VMS. CGI.pm also proves my point about EBCDIC systems:
if ($OS eq 'VMS') {
$CRLF = "\n";
} elsif ($EBCDIC) {
$CRLF= "\r\n";
} else {
$CRLF = "\015\012";
}
A saner version (with identical functionality) of that would actually be:
if ($OS eq 'VMS') {
$CRLF = "\n";
} elsif ($ANCIENT_MACOS) {
# $CRLF = "\015\012";
$CRLF = "\n\r"; # Same thing
} else {
$CRLF= "\r\n";
}
And bytes::length("\n") is of course the "byte representation" of newline, not the "text file representation."
So binmode causes your Perl string to end up with the "character representation" of "\n" (which isn't "\n") while w/o binmode you get the "byte representation"? Stop convoluting your mental model so much. Just declare MacOS an outlier and move to a sane mental model where "\n" is just "\n".
But if you don't, then you have to be very, very careful to never refer to "\n" as "a character" (and yet also be very clear that, inside of Perl, "\n" is always just one character). Good luck with that.
|