in reply to Comparing Large Files

I had to do this last year myself. I was working on a Solaris box, but the unix compare commands were too slow. I.e. It would run overnight with no result. :( I found out that the Win2000 file comparison "fc" works very fast (1-15 minutes depending on the data size.) This was on a PIII 660Mhz w/ 256MB ram.

Dogz

Replies are listed 'Best First'.
Re: Re: Comparing Large Files
by Anonymous Monk on May 16, 2003 at 18:14 UTC
    The problem with the core diff tools is that they assume the two files to be in perfect order... which they aren't. You could almost think of the problem as being a comparison of two sets of lines, where I want to know which lines aren't in both sets (ignoring order). Its a nasty problem if there were truly no order to the lines, but thankfully they are mostly in the same order, with some chunks out of order by a couple hundred lines (which is nothing given the size of the files... around 1.4 million lines)

      so just put them in the same order...

      PROMPT% cat file1 cat dog bird parrot perl yakko wakko PROMPT% cat file2 cat bird perl parrot yakko dot PROMPT% sort file1 > file1.sort PROMPT% sort file2 > file2.sort PROMPT% comm -23 file1.sort file2.sort > only.in.file1 PROMPT% comm -13 file1.sort file2.sort > only.in.file2 PROMPT% cat only.in.file1 dog wakko PROMPT% cat only.in.file2 dot

      I'll admit, sort isn't the fastest thing in the world for large files, but if you only need to run this once, then comming up with a complicated algorithm (which may or may not have bugs depending on your assumptions about this "N" you mentioned) probably isn't worth it.

      An ugly solution I can think of is breaking up the file into smaller Overlapping sub-parts that will fit into memory. Read in the smaller sections of the file into an array and then pick out the array that is in one but not the other. Note the arrays that are missing in the last thousand lines to make sure they don't show up in the next sub-part.