If a data file is supposed to have non-text content (audio, image, compressed data, etc), you should expect to see some 0x0A bytes that are not preceded by 0x0D. But if some jerk passes the file through a "unix2dos" conversion (or a text-mode ftp transfer to a Windows machine, which amounts to the same thing), every 0x0A (ascii LF) will be preceded by 0x0D (ascii CR). Here's a little test to check that out.
#!/usr/bin/perl use strict; die "$0 file2chk\n" unless @ARGV == 1 and -f $ARGV[0]; $/ = "\x0a"; # make sure we're "platform independent"! open( I, $ARGV[0] ) or die "$ARGV[0]: $!"; my $crlf; while (<I>) { if ( "\x0a" ne chop ) { # $. is incremented for the last record, $.--; # even when the file does not end with \x +0a } elsif ( "\x0d" eq chop ) { $crlf++; } } print "$ARGV[0] : $. \\x0A, $crlf CRLF\n";
The point is that if you run this on something like a *.wav or *.gz or *.zip file, etc, and the number of times you see "0x0A" is exactly equal to the number of times you see "CRLF", then it's extremely likely that this file has gone through a unix2dos conversion -- which would explain why playback has those annoying clicks, or why uncompression won't work, etc. -- and you should probably just delete it (don't even try to fix it).

Replies are listed 'Best First'.
Re: Check for "unix2dos" (CRLF) in binary files
by Aristotle (Chancellor) on Sep 17, 2004 at 07:19 UTC
    #!/usr/bin/perl use strict; use warnings; for my $file ( @ARGV ) { my ( $lf, $crlf ) = ( 0 ) x 2; open my $fh, '<', $file or die "open $file: $!\n"; local $_ = " "; while ( read $fh, $_, 65536, 1 ) { $lf += @{ [ /(?<!\x0d)\x0a(?!\x0d)/g ] }; $ctlf += @{ [ /\x0a\x0d/g, /\x0d\x0a/g ] }; $_ = chop; } print "$file: $lf LF, $crlf CRLF\n"; }

    Makeshifts last the longest.

      I wouldn't have thought it was necessary to look for "LFCR" as well as "CRLF" -- it seems to me the "\x0d" always comes first in the pair, and I don't recall ever seeing it the other way around. (I wonder if/when we'll start seeing a utf-16 version of unix2dos... heaven help us.)

      Apart from that, using a fixed-length read certainly is a good idea, for cases when files are really big and 0x0A's happen to be really few and far between (or non-existent). And your use of "$_ = chop" to cover the buffer edges is a nice trick. Thanks!

        I think I saw it that way around about once or twice. It's not particularly costly to lump that in, since the regex engine will shortcircuit to a very fast simple search for fixed strings, but I guess I'm just paranoid.

        The fixed record read is almost certainly a win, even with frequent LFs. In a 64k file full of LFs, the while loop will only iterate once, as opposed to 65536 times. All actual iteration is implicit in the regex engine, which is much faster. Now if there was a way to just ask for the number of matches without storing them anywhere, that would be even better. Maybe this does the trick:

        $lf += s/(?<!\x0d)\x0a(?!\x0d)/x/g; $ctlf += s/\x0a\x0d/xx/g + s/\x0d\x0a/xx/g;

        I'm hoping here that replacing with a same-length string will keep the engine from wasting too much effort shuffling the string guts in memory. If it's slower than the match+capture method, one could at least avoid the overhead of constantly setting up and tearing down anonymous arrays by changing the relevant statements to

        $lf += @match = /(?<!\x0d)\x0a(?!\x0d)/g; $ctlf += @match = ( /\x0a\x0d/g, /\x0d\x0a/g ] );

        and declaring @match just once at the top of the script.

        This would require some solid benchmarking on a bunch of diverse data to make any calls.

        Makeshifts last the longest.

        IIRC, the old MacOS (pre 10) text files were based on LF/CR.

        No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1