comment on

I think I saw it that way around about once or twice. It's not particularly costly to lump that in, since the regex engine will shortcircuit to a very fast simple search for fixed strings, but I guess I'm just paranoid.

The fixed record read is almost certainly a win, even with frequent LFs. In a 64k file full of LFs, the while loop will only iterate once, as opposed to 65536 times. All actual iteration is implicit in the regex engine, which is much faster. Now if there was a way to just ask for the number of matches without storing them anywhere, that would be even better. Maybe this does the trick:

$lf += s/(?<!\x0d)\x0a(?!\x0d)/x/g;
$ctlf += s/\x0a\x0d/xx/g + s/\x0d\x0a/xx/g;
[download]

I'm hoping here that replacing with a same-length string will keep the engine from wasting too much effort shuffling the string guts in memory. If it's slower than the match+capture method, one could at least avoid the overhead of constantly setting up and tearing down anonymous arrays by changing the relevant statements to

$lf += @match = /(?<!\x0d)\x0a(?!\x0d)/g;
$ctlf += @match = ( /\x0a\x0d/g, /\x0d\x0a/g ] );
[download]

and declaring @match just once at the top of the script.

This would require some solid benchmarking on a bunch of diverse data to make any calls.

Makeshifts last the longest.

In reply to Re^3: Check for "unix2dos" (CRLF) in binary files by Aristotle
in thread Check for "unix2dos" (CRLF) in binary files by graff

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.