Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Dealing with files with differing line endings

by dd-b (Monk)
on Nov 05, 2021 at 19:35 UTC ( [id://11138481] : perlquestion . print w/replies, xml ) Need Help??

dd-b has asked for the wisdom of the Perl Monks concerning the following question:

Because of multiple systems (with different OSs) sharing files on a file-server, I find my code running into "text files" with different line endings from the system-default for the system they are running on. I can't predict the line endings in advance (can't deduce it reliably from filename, extension, directory, or whatever).

The files are all small by today's standards, thousands of characters not billions, and in these applications I'm not worried they'll ever challenge memory size (so the idea below, which slurps the whole file first, doesn't look risky for this particular application).

Since $/ is a simple string (not a REGEXP), I can't just put a suitable REGEXP there and get the file split into lines on any standard terminator. (I've got DOS, FreeBSD and Unix, and possibly MAC interpretations of text files to cope with.)

Is there a consensus best-practice for this?

I'm currently thinking of slurping the whole file and then breaking it down in my code. If there isn't already a module that handles this cleanly?

Replies are listed 'Best First'.
Re: Dealing with files with differing line endings
by stevieb (Canon) on Nov 05, 2021 at 21:17 UTC

    My File::Edit::Portable was written to deal with this exact situation.

    Get a file handle of the file with the record separators changed to that of the local platform, make changes, and write it back to the same file with the original record separators:

    use File::Edit::Portable; my $rw = File::Edit::Portable->new; my $fh = $rw->read('file.txt'); ... $rw->write(contents => $fh);

    Get an array of a file's contents with the line endings stripped off (one line per element), make changes, and write the data back to the original file (the original line endings will be preserved and put back into place automagically):

    my @contents = $rw->read('file.txt'); for (@contents) { ... } $rw->write(contents => \@contents);

    There's a myriad of other magic you can do as well, like automatically making a backup copy of each file, chaging line endings, using custom line endings, checking what endings a file is using, splicing stuff into the files etc.

      Was guessing I wasn't the first person to have something like this problem! Thanks for pointing out your module.
Re: Dealing with files with differing line endings
by ikegami (Patriarch) on Nov 05, 2021 at 20:07 UTC

    All the systems you mentioned use CR LF or LF (unless you meant the ancient MacOS which used CR).

    So just use LF as the line terminator as usual, but use something like s/\s+\z// instead of chomp.

    while (<>) { s/\s+\z//; ... }

    Alternatively, you could add a :crlf layer to the handle.

    open(my $fh, '<:crlf', $qfn) or die("Can't open \"$qfn\": $!\n"); while (<$fh>) { chomp; ... }

    This already happens by default on Windows, which is why it can handle the listed file formats naturally.

      Good point! I was jumping back to a more general question than I need to solve. As you say, I can just force LF for line boundaries. Parsing the contents can handle various line separators with \R, I think it already does (or I could do my own chomp with suitable regexp to kill all kinds of line terminators).

        As you say, I can just force LF for line boundaries

        No need to force anything. $/ is already a LF on all systems except ancient MacOS. Just replace chomp; with s/\s+\z//;.

Re: Dealing with files with differing line endings
by LanX (Saint) on Nov 05, 2021 at 20:03 UTC
    I always thought chomp handles that.

    Could you provide us with an example which goes wrong?

    Possible solutions, (if needed)

    • replace chomp with a regex in your code
    • override chomp with your own version in legacy code.

    Could it be you are not using chomp at all, but setting $/ to get rid of the line-endings?

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Here an example to overide chomp

      use strict; use warnings; package NewChomp; use Data::Dump qw/pp dd/; use subs qw/chomp/; sub chomp { $_[0] =~ s/\n$//; # adjust here } pp my $line ="abcd\n"; chomp $line; pp $line;

      Just export it from a new module into your scripts, and adjust the regex to your needs.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      Chomp cleans off the end of a string based on the current value of $/. I need something to cause reading the next line of the file to terminate in the correct place. (And then I probably do also need to do something like chomp, but that's easy.)
Re: Dealing with files with differing line endings
by BillKSmith (Monsignor) on Nov 06, 2021 at 03:01 UTC
    A general solution is impossible. Any file can contain normal text characters that another OS would interpret as line separators. You may be able to assume that this will never happen with your data. Your idea of slurping the entire file (in binmode) into a string is probably the safest. Use anything you know about the file (line length, number of lines, words that only occur at the start or end of a line, etc) to determine which kind of file it is. Open the string as a memory file with the appropriate IO layer. You could then use the <> operator exactly as you normally would.
      We may be overthinking this. ikegami's solution should be fine. The exception is that ancient Mac which uses <CR> instead of <CR><LF> or <LF> for line endings. One of my users was using an old Mac to edit one of my config files and reported that my config file "didn't work". I talked with this guy and told him to set his text editor to "write DOS compatible files" and that ended the problem. Modern Macs use <LF>. Unless there is a specific strange requirement, writing code to handle ancient Mac is not worth the effort.
        As a practical matter, I am sure that you are right. However, it is important to know that there are corner cases. Consider the following contrived example.
        use strict; use warnings; use Test::More tests=>1; my $file = \do{ "This \n is not the end of a line on windows\r\n" }; open my $fh1, '<:raw', $file; my $chars_read = length(<$fh1>); close $fh1; my $chars_expected=47; is( $chars_read, $chars_expected, 'record length' );


        1..1 not ok 1 - record length # Failed test 'record length' # at line 15. # got: '6' # expected: '47' # Looks like you failed 1 test of 1.

        Unfortunately, my solution (use :crlf instead of :raw) does not work either.

Re: Dealing with files with differing line endings
by Anonymous Monk on Nov 06, 2021 at 14:34 UTC

    PerlIO::eol has not been updated in a while, but the last time I tried it still worked, and it installs successfully under Perl 5.34.0.