Do you have some sort of criterion for deciding whether or not a given file is a "text file"?
A general solution that I've used is to scan each file for all occurrences of "\x0a" and "\x0d" (LF and CR), and report their statistics in a useful way. I happen to have written a script some years ago to do just this, so I'll post it here:
#!/usr/bin/perl
use strict;
use warnings;
die "Usage: $0 filename [filename ...]\n"
unless @ARGV and -f $ARGV[0];
for my $file ( @ARGV ) {
my ( $cr, $lf, $crlf ) = ( 0 ) x 3;
unless ( open I, $file ) {
warn "can't open $file: $!\n";
next;
}
binmode I; ## (added as an update)
$_ = " ";
while ( read I, $_, 65536, 1 ) {
$lf += tr/\x0a/\x0a/;
$cr += tr/\x0d/\x0d/;
$crlf += s/\x0d\x0a/xx/g ;
$_ = chop;
$cr-- if ( $_ eq "\x0d" ); # a final CR or LF will get counte
+d
$lf-- if ( $_ eq "\x0a" ); # again on the next iteration
}
$cr++ if ( $_ eq "\x0d" );
$lf++ if ( $_ eq "\x0a" );
print "$file: $cr CR, $lf LF, $crlf CRLF\n";
}
=head1 NAME
chk-crlf
=head1 SYNOPSIS
chk-crlf filename [filename ...]
=head1 DESCRIPTION
This program will read through one or more files named on the command
line, and for each one, it prints to STDOUT a one-line report showing
the total quantities of carriage-return (CR) and line-feed (LF) bytes,
along with the number of byte pairs that are CRLF sequences, like
this:
unix-file1.txt: 0 CR, 80 LF, 0 CRLF
dos-file1.txt: 80 CR, 80 LF, 80 CRLF
binary-file.gz: 31 CR, 28 LF, 2 CRLF
This is handy for confirming any expectations you may have about the
nature of the file's content regarding line-termination characters.
=head2 Valid outcomes
If the three quantities are all equal, you have a valid MS-DOS "text
mode" (or internet format) file: every CR and LF in the data is part
of a CRLF pair. If the number of CR bytes (and CRLF pairs) is zero,
you have a valid "unix style" text file.
If there are slightly different quantities of CR and LF, and very few
CRLF pairs, you are probably looking at non-text data (e.g. audio,
image, or some form of compressed data). This in itself is not a
problem, if the file is supposed to have non-text content.
=head2 Not-so-valid outcomes
If there are more LF's than CR's, but all the CR's are involved in
CRLF pairs (CR < LF, CR == CRLF), you probably have a "hybrid" text fi
+le: a
unix system created some of the lines, and incorporated lines from
some MS-DOS-like source without normalizing the line termination.
This might not be a problem, but you may want to make the line
termination consistent to avoid problems for some kinds of processing.
If there are more CR's than LF's (e.g. roughly twice as many), but all
the LF's are involved in CRLF pairs (CR > LF, LF == CRLF), you might
be looking at a file that is supposed to have non-text content, but
has gone through a 'unix2dos' text-mode conversion, whereby all LF
bytes (or all that were not originally preceded by CR) have been
replaced by CRLF byte pairs. (Or you might be looking at a DOS-like
text file that happens to have extra CR characters embedded in some of
the lines.)
Usually, any sort of non-text file that has been through a unix2dos te
+xt
mode conversion is hopelessly corrupted and unusable -- there may be
no way of undoing the alteration, because it may be impossible to know
which LF characters were preceded by CR in the original (uncorrupted)
version of the data (as opposed to having a CR inserted by the
conversion). If you can, try to find a prior version of the file that
has not been affected by the conversion.
=cut
UPDATE: I added
binmode I; -- the original script had been written for use on unix/linux, but I presume the binmode call would be needed if running under ms-windows (which I don't use).
ANOTHER UPDATE: The pod above doesn't mention this (and maybe there's a diminishing need to mention it), but there's one other distinctive outcome that could show up: CR=LF, but CRLF=0. This is what you'd get from a "text" file that contains regular CRLF line terminations, but is encoded as UTF16 (whether big- or little-endian). There's also still some chance of seeing CR>0 and LF=0 (old-style Macintosh line terminations).