But if the only differences are near the end, then you'll waste a lot of time going through the file a line at the time.
Worse: if the files are equal, then you'll have to go through the whole file anyway.
Why do people always suppose you'll commonly get positive results near the start of the loop count?
Let's make a compromise: select and read a random block from both files, somewhere "in the middle", and compare. If the files are different, then you'll likely see it immediately, especially if the typical differences are in the addition or deletion of whole lines, and not replacement of single characters. | [reply] |
Why do people always suppose you'll commonly get positive results near the start of the loop count?
What I was supposing was that disk access is more expensive than anything else, so the best algorithm would be one that minimizes it. I supposed also that if the files differ, it will nearly always be before the end of the files, so stopping at the difference will virtually always mean avoiding some expensive reading.
Well, I went off and did some testing, and it turns out that a quick Digest::MD5 of each file was about three times faster than reading line-by-line with the loop I posted earlier—that's for identical files.
(I tested with a file of 10_000_000 lines, each with random alphanumeric data 10–1_000 characters long.)
Of course, if there really is an early give-away, spotting it early and aborting all that reading really does give a big advantage. In this case, though, I probably wouldn't expect that to happen often enough to be worth it.
| [reply] |