in reply to Re^2: Check for "unix2dos" (CRLF) in binary files
in thread Check for "unix2dos" (CRLF) in binary files

I think I saw it that way around about once or twice. It's not particularly costly to lump that in, since the regex engine will shortcircuit to a very fast simple search for fixed strings, but I guess I'm just paranoid.

The fixed record read is almost certainly a win, even with frequent LFs. In a 64k file full of LFs, the while loop will only iterate once, as opposed to 65536 times. All actual iteration is implicit in the regex engine, which is much faster. Now if there was a way to just ask for the number of matches without storing them anywhere, that would be even better. Maybe this does the trick:

$lf += s/(?<!\x0d)\x0a(?!\x0d)/x/g; $ctlf += s/\x0a\x0d/xx/g + s/\x0d\x0a/xx/g;

I'm hoping here that replacing with a same-length string will keep the engine from wasting too much effort shuffling the string guts in memory. If it's slower than the match+capture method, one could at least avoid the overhead of constantly setting up and tearing down anonymous arrays by changing the relevant statements to

$lf += @match = /(?<!\x0d)\x0a(?!\x0d)/g; $ctlf += @match = ( /\x0a\x0d/g, /\x0d\x0a/g ] );

and declaring @match just once at the top of the script.

This would require some solid benchmarking on a bunch of diverse data to make any calls.

Makeshifts last the longest.