Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^6: Dealing with files with differing line endings

by Marshall (Canon)
on Nov 15, 2021 at 20:33 UTC ( [id://11138850]=note: print w/replies, xml ) Need Help??


in reply to Re^5: Dealing with files with differing line endings
in thread Dealing with files with differing line endings

I experimented more with this (code attached below). Answering my own question - when reading a line using "<:raw", Perl is looking for a <LF> to determine the "end of line". This is the same thing that it does with the :CRLF layer. The difference is that with :raw, the CR (if any) immediately before the LF is not removed.

The terminology does get confusing because "\n" as written in Perl on Windows sometimes means <CR><LF> and sometimes it means only <LF>.

With the normal I/O layer, the <CR> in <CR><LF> will be removed before your Perl code ever sees the line. chomp() only operates on <LF>, not <CR><LF>.

Running two regexes as you suggest is not necessary, the standard I/O layer does this part: $text=~s/\r\n/\n/g; (remove any <CR> that immediately precedes a <LF>). Translating <CR> to <LF> would get the multiple lines contained within the input string into "normal line format".

So the rub here is that there is no easy way to say "give me a line" no matter old Mac,unix or windows. $/, the input record separator, is a string, not a regex. When you attempt to read a line from a file with <CR> terminated lines, you will get the entire file, not just one line because readline is looking for <LF>. Now having in effect slurped the entire file into one string variable, you can indeed split it up into "real lines". However now we have altered the program flow from reading a line at a time to reading the whole file into a buffer, modifying that buffer (perhaps with tr instead of regex) and then reading that buffer a line at a time.

Anyway, I did not see the need to burden the 99.99999% code with special stuff for this ancient Mac. There are also some memory issues with reading entire files into memory to process them when line by line processing is desired. It would also be possible to read part of the file, determine that \r should be the input record separator, then back up and use that. But that is "complicated".

I'm not working with Unix at the moment. But from memory, Perl code to read files line by line between Unix and Windows is the same. When reading a Windows file on Unix, the I/O layer zaps the <CR> and I never see it. When Windows reads a Unix file, it doesn't care that the <CR> isn't there. When writing a line on Unix, Perl writes a <LF> for "\n". When writing a line on Windows, Perl writes a <CR><LF> for "\n".

Mixed line ending files can happen. When I was working on Unix, my environment allowed me to click on a remote Unix file and edit it with my local Windows editor. Only the lines that I modified wound up with <CR><LF> endings. My editor preserved the exiting <LF> terminated lines. Perl and GNU C didn't have an issue with this and I didn't really worry about it. LPR was fussy. I had some simple Perl thing that read a line, chomped it, then printed line with "\n" (which on output is platform specific). Now that I think about it, it could be that chomp() was unnecessary, the read of the <CR><LF> line would have zapped the <CR> already. There would be no need to remove the <LF> only to add it back in.

Unix and Windows have <LF> in common and that works well. Ancient Mac with <CR> is a "weird duck".

use strict; use warnings; ### setup input file <CR>=0d <LF>=0a open (my $fh,'>',"testfilein.txt") or die "$!"; print $fh "bbb \naaa \r\n"; #62626220 0d0a 61616120 0d0d0a close $fh; #note spaces are for human reading open ($fh,'>',"testfileout.txt") or die "$!"; binmode $fh; #read from input with std <>, write binary to output file open (my $fh2, '<', "testfilein.txt") or die "$!"; while (my $line = <$fh2>) { print length($line),'_', $line, '|'; print $fh $line; #62626220 0a 61616120 0d0a } close $fh2; close $fh; print "*** run two...\n"; # use same read file # this time use :raw layer for reading # A line ends in <LF> like above, but the <CR> before it # (if any) is not removed. open ($fh,'>',"testfileout2.txt") or die "$!"; binmode $fh; open ($fh2, '<:raw', "testfilein.txt") or die "$!"; while (my $line = <$fh2>) { print length($line),'_', $line, '|'; print $fh $line; #62626220 0d0a 61616120 0d0d0a } __END__ 5_bbb "bbb "+LF 4+1=5 |6_aaa "aaa "+CRLF 4+2=6 |*** run two... 6_bbb "bbb "+CRLF 4+2=6 |7_aaa "aaa "+CRCRLF 4+3=7

Replies are listed 'Best First'.
Re^7: Dealing with files with differing line endings
by haukex (Archbishop) on Nov 15, 2021 at 21:21 UTC
    when reading a line using "<:raw", Perl is looking for a LF to determine the "end of line"

    It's looking for $/.

    chomp() only operates on LF, not CRLF.

    chomp operates on whatever $/ is set to, including if that's set to CRLF for whatever unusual reason.

    When reading a Windows file on Unix, the I/O layer zaps the CR and I never see it.

    Only if you explicitly specify the :crlf layer, which your code doesn't do.

    The terminology does get confusing because "\n" as written in Perl on Windows sometimes means CRLF and sometimes it means only LF.

    For about the millionth time: No. Maybe it's finally time to read Newlines in perlport?

      Only if you explicitly specify the :crlf layer, which your code doesn't do.

      Strawberry perl.exe adds the :crlf layer unless you tell it otherwise.

      C:\usr\local\share>perl -MConfig -MPerlIO -le "print for PerlIO::get_l +ayers(STDIN), '-'x10, $Config{myuname}" unix crlf ---------- Win32 strawberry-perl 5.30.0.1 #1 Thu May 23 12:20:46 2019 x64

      (I haven't used Active State since the early aughts, so I cannot tell you how the other major Windows port of perl behaves, though my vague recollections were that I had never heard of IO layers back then, but that newlines just worked right, as they do with modern Strawberry, so I am assuming they also set :crlf for you.)


      update: hmm, you even knew it was on by default on Windows in Re^9: How do I display only matches (from the other conversation you alluded to), so I have to assume I've missed something in the context of this thread. I don't see anything in the posted code that would override that (other than the :raw open, of course)... so I'm more confused than when I first posted this. :-( Maybe I've had too hard of a day, and I should stop trying at this point. Time to go home! :-)
        I have to assume I've missed something in the context of this thread.

        This bit in the text I quoted:

        When reading a Windows file on Unix

        It happens :-) And though it's just a guess, I'd be surprised if ActivePerl doesn't add the :crlf layer by default. I think Cygwin Perl doesn't.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11138850]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2024-04-18 21:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found