fxia has asked for the wisdom of the Perl Monks concerning the following question:

I've seen several articles mentioning the trick of setting $/ to "" to read paragraphs, since "" is a special case for $/ which essentially means "\n\n". But this does not work for DOS files, since empty line will be "\r\n\r\n". Can anyone tell the wisdom of handling DOS files? Thanks.

Replies are listed 'Best First'.
Re: $/ and DOS files
by John M. Dlugosz (Monsignor) on Jul 21, 2001 at 02:17 UTC
    Even though the file contains "\r\n", the Perl script sees a plain "\n" when reading from the file, by default. Translation of EOLN marks is done on a more primitive level, so no matter what OS you are on, you use "\n" in Perl.

    (to disable that, use binmode)

      But my case is to process the DOS file on a UNIX system. So the perl won't see \r\n the same as \n. If I do the truncation, it does work. But then I think it is a performance hit. I just wonder if there is any better way to do it.
        As noted in another message, you can't compose a value to $/ that works like the magic built-in paragraph mode.

        So, you could implement a filter, and read from that filter. Or, slurp in the whole file and use split, which does allow regex for the delimiter. That is easy and speedy, if your file is small enough so memory is not an issue.

        Something like:

        my @lines= split (/(?:\r?\n){2,}/, do { local $/; <INPUT>}); # lines already chomped, since delimiter not included.
        —John
Re: $/ and DOS files
by synapse0 (Pilgrim) on Jul 21, 2001 at 02:19 UTC
    You can set the $/ to be any delimiter you want. In your case, you probably want to use $/ = "\r\n\r\n"; (or whatever happens to be your delimiter of choice)
    -Syn0
      Not quite.

      The normal behavior of Perl (unless you use binmode is to treat the end of line sequence as "\n" no matter what the real form on the platform is. So the input command won't see the "\r" to match it. If you were operating in binmode, you would indeed need to reset $/ to match.

      But, the empty string has a special meaning. It will match any number of consecutive lines to be the terminator. "\n\n" will blindly take two lines, even if the third or more is still blank. There is no way to set $/ to a normal (non-magic) value to accomplish the same thing, since it takes a literal string not a regex.

      Perhaps that idea is outdated. Why not allow the record seperator to be a regex or even a code ref, and eliminate the special built-in case?

      —John