in reply to Check for "unix2dos" (CRLF) in binary files

#!/usr/bin/perl use strict; use warnings; for my $file ( @ARGV ) { my ( $lf, $crlf ) = ( 0 ) x 2; open my $fh, '<', $file or die "open $file: $!\n"; local $_ = " "; while ( read $fh, $_, 65536, 1 ) { $lf += @{ [ /(?<!\x0d)\x0a(?!\x0d)/g ] }; $ctlf += @{ [ /\x0a\x0d/g, /\x0d\x0a/g ] }; $_ = chop; } print "$file: $lf LF, $crlf CRLF\n"; }

Makeshifts last the longest.

Replies are listed 'Best First'.
Re^2: Check for "unix2dos" (CRLF) in binary files
by graff (Chancellor) on Sep 17, 2004 at 07:34 UTC
    I wouldn't have thought it was necessary to look for "LFCR" as well as "CRLF" -- it seems to me the "\x0d" always comes first in the pair, and I don't recall ever seeing it the other way around. (I wonder if/when we'll start seeing a utf-16 version of unix2dos... heaven help us.)

    Apart from that, using a fixed-length read certainly is a good idea, for cases when files are really big and 0x0A's happen to be really few and far between (or non-existent). And your use of "$_ = chop" to cover the buffer edges is a nice trick. Thanks!

      I think I saw it that way around about once or twice. It's not particularly costly to lump that in, since the regex engine will shortcircuit to a very fast simple search for fixed strings, but I guess I'm just paranoid.

      The fixed record read is almost certainly a win, even with frequent LFs. In a 64k file full of LFs, the while loop will only iterate once, as opposed to 65536 times. All actual iteration is implicit in the regex engine, which is much faster. Now if there was a way to just ask for the number of matches without storing them anywhere, that would be even better. Maybe this does the trick:

      $lf += s/(?<!\x0d)\x0a(?!\x0d)/x/g; $ctlf += s/\x0a\x0d/xx/g + s/\x0d\x0a/xx/g;

      I'm hoping here that replacing with a same-length string will keep the engine from wasting too much effort shuffling the string guts in memory. If it's slower than the match+capture method, one could at least avoid the overhead of constantly setting up and tearing down anonymous arrays by changing the relevant statements to

      $lf += @match = /(?<!\x0d)\x0a(?!\x0d)/g; $ctlf += @match = ( /\x0a\x0d/g, /\x0d\x0a/g ] );

      and declaring @match just once at the top of the script.

      This would require some solid benchmarking on a bunch of diverse data to make any calls.

      Makeshifts last the longest.

      IIRC, the old MacOS (pre 10) text files were based on LF/CR.

      No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

        Actually, my recollection is that the pre-10 mac text files used just CR (no LF) -- this caused a lot of trouble for the unwary, because any unix or win/dos text process would slurp a whole mac text file as a single "line": there were no LF's to mark end-of-line.

        I sure don't miss the old macs.