in reply to Re^5: How do I display only matches
in thread (SOLVED) How do I display only matches

I ran your code on my Windows machine and got the same results. Hooray! This is exactly as expected!!!

I copied the text for "$" verbatim from the Perl regex docs. Normally that is good enough. However in this case I see that some further "yeah but" explanation is required!

Problem 1: What "\n" is can be both platform and sometimes context dependent! If I write a "\n" on my Windows machine, that means 2 characters: <CR><LF> (Carriage Return, Line Feed). So when you write "foo\r\n", on Windows that means <CR><CR><LF>. This extra <CR> means that the line doesn't end in "\n", <CR><LF> (Carriage Return, Line Feed). There is indeed something else between the "o" and the line ending and your regex doesn't match - this is correct behavior.

You may not know this (many folks don't), but no matter what the OS platform, when writing to a network socket, "\n" means <CR><LF>. <CR><LF> is the network standard for line endings. So, yes, even on Unix, a write to network socket will be <CR><LF>, while a write to a disk file will be just <LF>. Windows uses the network standard for disk writes - so everywhere on Windows \n means the 2 characters <CR><LF>.

Problem 2: Not every cross platform case and every platform direction is handled automatically by Perl. If you are on a single platform, then "$" will work as the docs describe as will chomp(). I have one program that needs to work on: a) old Mac "\n" means <CR> in files, b)Windows "\n" means <CR><LF> in files, c)Unix, "\n" means <LF> in files. When I write code that has to work with all 3 platforms, I use regex instead of chomp to delete the line endings. s/\s*$//; deletes all whitespace at the end of the line (including line endings like <CR><LF> which are considered "whitespace".

Another thought: I told the OP that there was no need to "chomp" if you are just going to add the line ending back in. That is true as long as you are processing a file and writing a file for the same platform. There are some cases where you'd "chomp" and then print "$_\n" to change the line endings.

I hope this post adds more clarity to the issue. But it probably raises more "yeah, but what if..." questions than it answers. This is all more complex than our OP asked about. I suggest starting a new thread if there is interest in discussing the "dirty details".

Update: To allow for this <CR><CR><LF> situation:

use warnings; use strict; use Data::Dump; for my $str ( "foo", "foo\n", "foo\r\n" ) { dd $str, scalar $str=~/o\s*$/; } __END__ ("foo", 1) ("foo\n", 1) ("foo\r\n", 1)

Replies are listed 'Best First'.
Re^7: How do I display only matches
by haukex (Archbishop) on Sep 25, 2019 at 21:20 UTC
    If I write a "\n" on my Windows machine, that means 2 characters: <CR><LF> (Carriage Return, Line Feed). So when you write "foo\r\n", on Windows that means <CR><CR><LF>. This extra <CR> means that the line doesn't end in "\n", <CR><LF> (Carriage Return, Line Feed). There is indeed something else between the "o" and the line ending and your regex doesn't match

    I've seen you say something along these lines a couple times before, and I'm sorry, but it's flat out wrong.

    C:\>perl -wMstrict -MDevel::Peek -e "my$x=qq{\n};Dump($x)" SV = PV(0x32aa98) at 0x577ef8 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x556ba4 "\n"\0 CUR = 1 LEN = 10 COW_REFCNT = 1

    Note CUR = 1 - there is only one character in that string, not two. I dimly remember that there might have been some builds of Perl for Windows back in the 90's that may have tried to handle it differently, and I remember being confused by this early on as well, but for a long time now, on both Windows and *NIX, \n is LF, and that's what Perl uses internally. The difference is that on Windows the :crlf layer translates that on input and output, but it doesn't change the internal representation, how regexes work, or the default $/ - even on Windows, it's "\n", one byte, "\x0A".

    I suggest you take the time to read Newlines in perlport and PerlIO.

    I copied the text for "$" verbatim from the Perl regex docs.

    Where in the Perl docs does it say $ matches before \r?

      Let's start with some basics:
      use warnings; use strict; open (my $file, '>', "testendings") or die "unable to open testendings for write! $!"; print $file "test\n";
      The file testendings contains binary(ok, Hex)
      746573740D0A
      0D is CR and 0A is LF.

      Can you replicate this and agree with it on Windows?

        Can you replicate this and agree with it on Windows?

        Yes, of course, as I said, that's because on Windows, the :crlf layer is active by default. The single byte 0A in the string "test\x0A" is being translated by that layer to 0D0A on output, but the internal representation of that string is still just those five bytes, not six ("test\x0D\x0A" or "test\r\n") as you claimed earlier.

        It's all explained fairly well in Newlines in perlport and PerlIO. I suggest you take the time to read and understand that, and test the facts I've already shown for yourself, before we discuss further.

Re^7: How do I display only matches
by haukex (Archbishop) on Sep 25, 2019 at 21:52 UTC
    You may not know this (many folks don't), but no matter what the OS platform, when writing to a network socket, "\n" means <CR><LF>. ... So, yes, even on Unix, a write to network socket will be <CR><LF>, while a write to a disk file will be just <LF>.

    The way you worded this actually annoyed me enough that I double-checked my knowledge on this.

    serv.pl:

    use warnings; use strict; use IO::Socket::INET; use Devel::Peek; my $sock = IO::Socket::INET->new(Listen => 5, LocalPort => 9000, LocalAddr => 'localhost', Proto => 'tcp') or die $!; my $cli = $sock->accept(); $cli->read(my $in, 5); Dump($in);

    cli.pl:

    use warnings; use strict; use IO::Socket::INET; my $sock = IO::Socket::INET->new("localhost:9000") or die $!; print $sock "foo\n\0\0";

    Output of serv.pl:

    SV = PV(0x573aeb89cfe0) at 0x573aeb8e2b50 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x573aeb8ed890 "foo\n\0"\0 CUR = 5 LEN = 10

    And, a Wireshark capture of the raw packet on the wire shows:

    66:6f:6f:0a:00:00

    On Linux and on Windows. You're wrong on both of those counts as well. (You can add binmode $sock, ':crlf'; to the client, and it'll be 0d0a on both platforms, as expected.)

    I hope this post adds more clarity to the issue.

    No, it spreads false and confusing information.

      To hopefully add a bit more clarity (or at least improve my own understanding), POSIX has a binary model of files and "\n" is always <LF> in user programs. But sometimes the kernel can translate <LF> to <CR><LF> and this usually means that a terminal driver is involved somewhere — this is frequently seen when using Expect, which uses the pty facilities, which emulate a terminal and therefore involve the kernel terminal driver.

      As far as I know, sockets are always binary and no such translation ever occurs, so I am unsure where this misinformation about network line endings originated. Perhaps STREAMS had such a translation module?

        I don't know enough about the details of the kernel's behavior to say for sure, but what you write sounds correct to me. I have my guesses about where the misinformation originated, but of course I can't say what another person is thinking.

      I can see that you are annoyed. I don't want to annoy you or anybody!
      Let's relax and get to the facts where we agree...

      We both agree that "\n" prints something different under Unix than on Windows.

      I further claim that all network communication uses <CR><LF> as the transmitted line ending.
      You do not believe that.

      I need to find a Unix machine to test upon.
      I am curious what these nulls mean? "foo\n\0"\0

        Accurate wording is important.

        We both agree that "\n" prints something different under Unix than on Windows.

        No. What gets written to a handle by a print "\n" can be the same on both platforms, or different, depending on which I/O layers are in effect.

        I further claim that all network communication uses <CR><LF> as the transmitted line ending. You do not believe that.

        No, I didn't say that either. You claimed that "no matter what the OS platform, when writing to a network socket, "\n" means <CR><LF>", which I proved to be incorrect. This is completely separate from the fact that many network protocols do indeed use CRLF as their standard line ending (Update: and certainly not "all" network protocols use CRLF).

        I am curious what these nulls mean? "foo\n\0"\0

        I added the \0s for two reasons: the server is hard-coded to expect five bytes, and I wanted to make sure that the client always sends at least that many bytes, second, I wanted it to be clear that the server isn't reading too few bytes and cutting off something relevant. The \0 after the quote is AFAIK Perl's way of saying the string is null-terminated (ASCIIZ).

        I believe that the confusion here is that the standards specify use of <CR><LF> (as far back as RFC139, page 3, on a quick check) but *nix has a far simpler I/O model than was contemplated for the systems on the early ARPAnet. (ASCII itself is RFC20, by the way.)

        So for correct and portable network usage, you are supposed to use "\015\012" rather than "\n" anyway, although Perl's ":crlf" PerlIO layer should cause "\n" to be emitted as "\015\012". Confused enough yet?