greenhorn has asked for the wisdom of the Perl Monks concerning the following question:

Someone at work wrote to an in-house Perl discussion alias:
Does anyone know a clean way to test for
DOS v. UNIX EOL in a text file (using Perl) ? It seems
that
 (chop($UnixLine) == chop($DOSLine))returns true =-(

No one replied. I thought: hey, I might be a mere newbie, but I'll bet I can figure this out.

Wrong. :) I made some small test files with both <CR><LF> line endings and <LF>-only line endings. Then I watched the results of extracting only those characters from each line in the Perl Builder watch-window. It appeared as if the same characters were returned each time (never mind that one line-ending was \x0D\x0Aand the other was \x0Aalone).

It struck me that using ==there wasn't right; should he not be using "eq"? Had he in fact
actually compared 0 with 0? (0 == 0does have a certain ring of truth to it.:)

Does perl for Win32 "internally" convert Unix newlines to <CR><LF>?

Replies are listed 'Best First'.
RE: Unix \n vs. DOS \n
by Abigail (Deacon) on Jul 15, 2000 at 15:12 UTC
    Somewhere, Tom has a large writing about this. But the basics is that DOS stores CR LF only for text files, and only when written on a physical device. As soon as you read it in, the C library turns the physical line ending of CR LF into the logical newline \n. And when you write it to a file, the reverse happens. That is, if you run the program under DOS.

    If you take your DOS file to a Unix platform, only the LF gets mapped to the logical newline \n (which happens to be represented with a LF character as well). The preceeding CR byte is considered by Unix to be just another byte. Also note that chop chops of the last character of a string. One character, nothing more. So, if you are on Unix, reading a line from either a Unix file or a DOS file, the last character will be LF, aka \x0A.

    So, yes, the comparison should have been done with eq instead of ==, but that still doesn't make a difference, "\x0A" eq "\xOA".

    There is flawless way to determine wether something is a "Unix line" or a "DOS line". "Unix line"s end with a LF character, and "DOS line"s with CR LF. However, there is nothing that forbids a "Unix line" to have a CR character just before the LF character.

    -- Abigail

(Ovid) RE: Unix \n vs. DOS \n
by Ovid (Cardinal) on Jul 15, 2000 at 21:00 UTC
    As a side note, don't use chop to get rid of newlines. I see this all the time in programs and it makes me cringe. You want to use chomp.

    chomp will only remove the last character if it's a newline. Consider the following "harmless" code:

    #!/usr/bin/perl -w use strict; while (<DATA>) { chop; print "$_\n"; } __DATA__ this is a test this is another
    You can't see it in the above code, but I deliberately did not hit "Enter" after the last line. I even hit backspace a few times to ensure that there was nothing after the word "another". The result?
    this is a test this is anothe
    chop happily removed the "r" in another. chomp was designed for situations like that and should be used where appropriate.

    Cheers,
    Ovid

Re: Unix \n vs. DOS \n
by vkonovalov (Monk) on Jul 15, 2000 at 14:34 UTC
    You probably forgot to use "binmode" built-in function, which makes sence for text-mode or binary-mode.

    Otherwise, if you're inside perl script, then perl makes UNIX-like line endings, for example in HERE-IN strings and inside any strings:

    $a=<<"EOS"; abcd efgh EOS
    and
    $a="abcd efgh ";
    and
    $a="abcd\nefgh\n";
    are the same.
RE: Unix \n vs. DOS \n
by BigJoe (Curate) on Jul 15, 2000 at 16:34 UTC
    To answer your question

    Does perl for Win32 "internally" convert Unix newlines to CR-LF?

    NO. I usually do a regular exp to convert all \n s to \r\n like this
    $mystring = ~s/\n/\r\n/g;


    Hope this helps.

    --BigJoe
Re: Unix \n vs. DOS \n
by greenhorn (Sexton) on Jul 16, 2000 at 15:20 UTC
    I believe the fellow at work used chop because he wanted to have Perl return the line-ending character; chomp seems to return only a result code and not the "chomped" character.

    I created a four-line text file in which two lines had CR/LF line endings and the other two had LF-only line endings. Then, a small script that reads each line of the file. Following is the business end of it. (All lines in the file have "F" immediately before the line boundary.)

    # TWO LINES IN THE FILE MEET THE FOLLOWING CRITERIA: print "ends CRLF\n" if /F\x0D\x0A$/; print "ends CRLF\n" if /F\r\n/; print "contains CR\n" if /\x0D/; print "contains CR\n" if /\r/; # AND THE OTHER TWO LINES IN THE FILE MATCH THIS: print "ends LF only\n" if /F\x0A$/;

    But the script printed only this: ends LF only. It never did print ends CRLF or contains CR.

    If perl doesn't make some internal translation of the carriage-return characters when it's reading a file, then why that result? Are the tests above not sufficient?

      chomp returns the number of characters removed. It removes whatever's in $/, so he can just check that.

      Update: Yes, $/, as jlp pointed out.