http://qs1969.pair.com?node_id=1118313

barryghall has asked for the wisdom of the Perl Monks concerning the following question:

I need to determine the kind of line ending (Windows, Unix of Mac) used in each of a set of input files and based on the kind of line endings convert the line endings of the file to Unix.

Assume that the array @Files holds the paths to the files of interest

pseudocode would be:

For each file $F in @Files

if $F is a text file:

my LE = getEndings($F) #$LE is the endings type, Windows, Unix or Mac

if ($LE eq 'Windows' or $LE eq 'Mac') {change $F to Unix line endings}

getEndings is a subroutine that determines the line endings

The files of interest can be huge so I can't expect to slurp the entire file contents into memory. Instead, to convert a Windows or Mac file I expect to read the file line by line, chomp each line, then writing that line with usual Unix line endings to a file named temp.txt. After closing temp.txt I would remove the original file and rename temp.text to the original file name.

I have not been able to figure out how to write a getEndings subroutine that works with all three kinds of line endings. Any suggestion will be much appreciated<\p>

The notion to read files line by line and write lines to temp.txt should work for files with Windows endings but with Mac line ending the lines would not be read correctly. Any ideas?

Replies are listed 'Best First'.
Re: How to determine type of line endings in a text file from within a script
by graff (Chancellor) on Mar 02, 2015 at 00:46 UTC
    Do you have some sort of criterion for deciding whether or not a given file is a "text file"?

    A general solution that I've used is to scan each file for all occurrences of "\x0a" and "\x0d" (LF and CR), and report their statistics in a useful way. I happen to have written a script some years ago to do just this, so I'll post it here:

    #!/usr/bin/perl use strict; use warnings; die "Usage: $0 filename [filename ...]\n" unless @ARGV and -f $ARGV[0]; for my $file ( @ARGV ) { my ( $cr, $lf, $crlf ) = ( 0 ) x 3; unless ( open I, $file ) { warn "can't open $file: $!\n"; next; } binmode I; ## (added as an update) $_ = " "; while ( read I, $_, 65536, 1 ) { $lf += tr/\x0a/\x0a/; $cr += tr/\x0d/\x0d/; $crlf += s/\x0d\x0a/xx/g ; $_ = chop; $cr-- if ( $_ eq "\x0d" ); # a final CR or LF will get counte +d $lf-- if ( $_ eq "\x0a" ); # again on the next iteration } $cr++ if ( $_ eq "\x0d" ); $lf++ if ( $_ eq "\x0a" ); print "$file: $cr CR, $lf LF, $crlf CRLF\n"; } =head1 NAME chk-crlf =head1 SYNOPSIS chk-crlf filename [filename ...] =head1 DESCRIPTION This program will read through one or more files named on the command line, and for each one, it prints to STDOUT a one-line report showing the total quantities of carriage-return (CR) and line-feed (LF) bytes, along with the number of byte pairs that are CRLF sequences, like this: unix-file1.txt: 0 CR, 80 LF, 0 CRLF dos-file1.txt: 80 CR, 80 LF, 80 CRLF binary-file.gz: 31 CR, 28 LF, 2 CRLF This is handy for confirming any expectations you may have about the nature of the file's content regarding line-termination characters. =head2 Valid outcomes If the three quantities are all equal, you have a valid MS-DOS "text mode" (or internet format) file: every CR and LF in the data is part of a CRLF pair. If the number of CR bytes (and CRLF pairs) is zero, you have a valid "unix style" text file. If there are slightly different quantities of CR and LF, and very few CRLF pairs, you are probably looking at non-text data (e.g. audio, image, or some form of compressed data). This in itself is not a problem, if the file is supposed to have non-text content. =head2 Not-so-valid outcomes If there are more LF's than CR's, but all the CR's are involved in CRLF pairs (CR < LF, CR == CRLF), you probably have a "hybrid" text fi +le: a unix system created some of the lines, and incorporated lines from some MS-DOS-like source without normalizing the line termination. This might not be a problem, but you may want to make the line termination consistent to avoid problems for some kinds of processing. If there are more CR's than LF's (e.g. roughly twice as many), but all the LF's are involved in CRLF pairs (CR > LF, LF == CRLF), you might be looking at a file that is supposed to have non-text content, but has gone through a 'unix2dos' text-mode conversion, whereby all LF bytes (or all that were not originally preceded by CR) have been replaced by CRLF byte pairs. (Or you might be looking at a DOS-like text file that happens to have extra CR characters embedded in some of the lines.) Usually, any sort of non-text file that has been through a unix2dos te +xt mode conversion is hopelessly corrupted and unusable -- there may be no way of undoing the alteration, because it may be impossible to know which LF characters were preceded by CR in the original (uncorrupted) version of the data (as opposed to having a CR inserted by the conversion). If you can, try to find a prior version of the file that has not been affected by the conversion. =cut
    UPDATE: I added binmode I; -- the original script had been written for use on unix/linux, but I presume the binmode call would be needed if running under ms-windows (which I don't use).

    ANOTHER UPDATE: The pod above doesn't mention this (and maybe there's a diminishing need to mention it), but there's one other distinctive outcome that could show up: CR=LF, but CRLF=0. This is what you'd get from a "text" file that contains regular CRLF line terminations, but is encoded as UTF16 (whether big- or little-endian). There's also still some chance of seeing CR>0 and LF=0 (old-style Macintosh line terminations).

Re: How to determine type of line endings in a text file from within a script
by Anonymous Monk on Mar 01, 2015 at 22:00 UTC