rvosa has asked for the wisdom of the Perl Monks concerning the following question:

I am maintaining a script that does this:
sub setLineBreak { my $inFile = shift; $/ = "\n"; open( IN, "<$inFile" ) or die "Cannot open $inFile: $!\n"; while (<IN>) { if ( $_ =~ /\r\n/ ) { print "DOS line breaks detected ...\n" if ($verbose); $/ = "\r\n"; last; } elsif ( $_ =~ /\r/ ) { print "Mac line breaks detected ...\n" if ($verbose); $/ = "\r"; last; } else { print "Unix line breaks detected ...\n" if ($verbose); $/ = "\n"; last; } } close IN; }
The idea is that the input line separator needs to be detected and $/ is then adjusted accordingly. If all you do, subsequently, is open the file, and do:
while(<FILE>){ chomp; # etc. }
then I don't think that while $/ adjustment thing is necessary. I for one have never seen such a check before. I'm tempted to just remove the whole sub. Why shouldn't I? Thanks!

Replies are listed 'Best First'.
Re: alter $/ - but why?
by derby (Abbot) on Aug 03, 2005 at 18:11 UTC

    from perlport:
    When dealing with binary files (or text files in binary mode) be sure to explicitly set $/ to the appropriate value for your file format before using chomp().

    So if your script is accepting files from all different types of OS'es and newlines are not appropriately converted during the transfer, then you're going to have to explicitly set the input record seperator

    -derby
Re: alter $/ - but why?
by radiantmatrix (Parson) on Aug 03, 2005 at 18:44 UTC
    derby points out the answer to your question, but there's another couple of pieces.
    1. If you want to process the file one line at a time, you need to set $/ anyway, or you may not be reading one line at a time.
    2. If you need to preserve the original line-endings for a write-out operation at some point, you can just easily modify the sub to set $\ as well.

    Because of the first, the whole structure is kind of odd anyway, since with Mac line endings, you'd slurp the whole file to find out that you have those line endings.

    Better to do:

    open IN, '<', $filename or die ("Can't open $filename: $!"); sysseek IN, -5, 2; my $last_five; sysread IN, $last_five, 5; ## find out what the EOL chars are and set $\ to match $/ = $1 if $last_five =~ m/(\r{0,1}\n)$/s; sysseek IN,0,0; while (<IN>) { chomp; # now process stuff # }

    This is predicated on the text-files being well-formed (ending with an EOL before EOF), so you may need to handle the possibility of malformed files or whatnot.

    <-radiant.matrix->
    Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
    The Code that can be seen is not the true Code
    "In any sufficiently large group of people, most are idiots" - Kaa's Law
      Because of the first, the whole structure is kind of odd anyway, since with Mac line endings, you'd slurp the whole file to find out that you have those line endings.
      Actually, if you set $/ = "\n" (which is the default), you only have a problem if you read files created by another OS. "\n" isn't a fixed byte - it's the appropriate bytesequence for the OS.

      That's why print "$line\n" is the portable way of printing lines to (text)files, but the unportable way of writing to sockets that a protocol that uses CR/LF as its line terminator - as many popular protocols do. And printing "$line\r\n" is unportable as well - use print "$line\x0A\x0D".

        If you are guaranteed to only read your OS's native formats, then you wouldn't need this routine at all. Therefore, I assumed the OP has this code in place because the script running on one OS is likely to read files created by several different OSes.

        So, I stand by my statement: if you read a file with Mac line endings (say, on a Unix box), using the code in the top node would read the whole file, since $/ would be looking for a Unix-style line-endings, which don't contain "\r";

        Your point about using the hex values for setting is a good one to remember, but the code as I wrote it automatically accounts for that. As for using the same line endings for output as have been determined for input, $\ = $/ is sufficient.

        <-radiant.matrix->
        Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
        The Code that can be seen is not the true Code
        "In any sufficiently large group of people, most are idiots" - Kaa's Law
Re: alter $/ - but why?
by betterworld (Curate) on Aug 03, 2005 at 18:36 UTC
    To make your code even more portable, you should replace every "\n" by "\012" (except those "\n" that are printed).

    If you don't, your code would not run properly on Mac systems.

      Well, yes and no... actually the tests for bare "\n" and "\r" would work, but the test for "\r\n" would fail (at least, in MacPerl on MacOS Classic). It is advisable to replace "\r\n" and /\r\n/ with "\015\012" and /\015\012/, respectively.
Re: alter $/ - but why?
by graff (Chancellor) on Aug 04, 2005 at 03:02 UTC
    As indicated previously, "chomp" is equivalent to "s{$/$}{}" on a string, so if you're going to use it on files of unknown origin (line-endings varying from file to file), it would be a good idea to make sure that $/ is set appropriately for each file.

    But that sub does have its drawbacks: apart from the fact it will pull in the full content of a "\r-only" type of text file, there is also the possibility that a single file could contain a variety of patterns involving "\r" and "\n" -- e.g. someone on a unix box quickly edits CRLF-type file, adding a couple "\n-only" lines at the top, or the file contains stuff other than text, etc.

    If the goal is simply to be able to handle all sorts of line-termination patterns (and you aren't worried about getting hit with a massive Mac "\r-only" file that'll chew up too much RAM), you could do without the sub and go right to a main processing loop like this:

    $/ = "\xa"; while( <FILE> ) { s/\xd?\xa$//; # does what chomp would do, handles CRLF and LF-onl +y for my $line ( split /\xd/, $_, -1 ) # handles CR-only cases { # now we're line-oriented no matter what the input style is... } }
    OTOH, if the goal is to be scrupulous and careful about knowing what sorts of line termination are showing up in your data files, write a separate diagnostic for that, have it produce a suitably detailed report for each file (e.g. number of "(\r\n)+", number of "(\n)+", number of "(\r)+"), and then configure your data-processing script(s) to work from that report.
Re: alter $/ - but why?
by samtregar (Abbot) on Aug 03, 2005 at 20:50 UTC
    Have you tried it? I recommend you write three tests - one with source text with each line-ending style. Verify that it works correctly with the original code. Then make your change and see if it still works. If it does, you're done. If not you'll need to learn more about what chomp() does.

    For bonus points, write your tests using Test::More!

    -sam