in reply to Re: Comparing Large Files
in thread Comparing Large Files

The problem with the core diff tools is that they assume the two files to be in perfect order... which they aren't. You could almost think of the problem as being a comparison of two sets of lines, where I want to know which lines aren't in both sets (ignoring order). Its a nasty problem if there were truly no order to the lines, but thankfully they are mostly in the same order, with some chunks out of order by a couple hundred lines (which is nothing given the size of the files... around 1.4 million lines)

Replies are listed 'Best First'.
Re: Re: Re: Comparing Large Files
by hossman (Prior) on May 17, 2003 at 01:47 UTC

    so just put them in the same order...

    PROMPT% cat file1 cat dog bird parrot perl yakko wakko PROMPT% cat file2 cat bird perl parrot yakko dot PROMPT% sort file1 > file1.sort PROMPT% sort file2 > file2.sort PROMPT% comm -23 file1.sort file2.sort > only.in.file1 PROMPT% comm -13 file1.sort file2.sort > only.in.file2 PROMPT% cat only.in.file1 dog wakko PROMPT% cat only.in.file2 dot

    I'll admit, sort isn't the fastest thing in the world for large files, but if you only need to run this once, then comming up with a complicated algorithm (which may or may not have bugs depending on your assumptions about this "N" you mentioned) probably isn't worth it.

Re: Re: Re: Comparing Large Files
by Anonymous Monk on May 16, 2003 at 18:24 UTC
    An ugly solution I can think of is breaking up the file into smaller Overlapping sub-parts that will fit into memory. Read in the smaller sections of the file into an array and then pick out the array that is in one but not the other. Note the arrays that are missing in the last thousand lines to make sure they don't show up in the next sub-part.