I have to very large text files (about 300MB each) and I want to find the set of lines (defined as text between \n characters) that appear in one file but not the other. This would be easy if there wasn't so much data. I can't brute force this either... each file contains about 1.4 million lines. On the other hand, the files are somewhat ordered. I'm not sure what the disparity is, but for any given line in fileA, if its in fileB it will appear with N lines of the previous match. I suspect that maintaining some kind of look ahead cache will be the trick, but I'm not sure how to do it.