Re^4: quick and safe way to deal with this?

Why do people always suppose you'll commonly get positive results near the start of the loop count?

What I was supposing was that disk access is more expensive than anything else, so the best algorithm would be one that minimizes it. I supposed also that if the files differ, it will nearly always be before the end of the files, so stopping at the difference will virtually always mean avoiding some expensive reading.

Well, I went off and did some testing, and it turns out that a quick Digest::MD5 of each file was about three times faster than reading line-by-line with the loop I posted earlier—that's for identical files.

(I tested with a file of 10_000_000 lines, each with random alphanumeric data 10–1_000 characters long.)

Of course, if there really is an early give-away, spotting it early and aborting all that reading really does give a big advantage. In this case, though, I probably wouldn't expect that to happen often enough to be worth it.

Comment on Re^4: quick and safe way to deal with this?