Re: Compare large files

I recall making the suggestion in your previous thread, which got you started down this path, and part of my suggestion was to make sure that these output files themselves be created in sorted order, so that you would not have to sort them later, and comparison of two files would be much easier.

But based on the data sample you showed in one of your replies here, it looks like the files are not sorted. So the problem you need to fix is in the program that produces these files -- they should be written in sorted order.

Then you can use the standard "diff" utility, which will correctly show:

lines in file1 absent from file2
lines in file2 not present in file1
lines where some portion of file1 content differs from file2 content

And "diff" already knows how to manage big files -- it might take a while, but I'm pretty sure it will finish.

Also, it might help if you consider breaking your outputs into smaller pieces. How hard/bad would it be to have your directory scan process create 10 files of 100 MB each on average (or 100 files of 10 MB each on average)? I think the directory structure should provide a sensible way to do that...

(update/ In fact, it might be worthwhile to simply create one tabulation file per directory -- I believe you start with a list of the directories being scanned, so the task becomes: create and compare table files for each directory in the list; that should be pretty simple to maintain, and will run as quick as any other approach. /update)

One last point, again based on the data sample you posted above. Are you sure that all differences are equally important and relevant? If yes, then using diff is fine. If not, either adjust the script that creates these files, to avoid cases where unimportant differences are present in the data, or else you'll have to write your own customized perl variant of diff (or better yet, a filter on the output from diff) to exclude unimportant differences.

Comment on Re: Compare large files

Replies are listed 'Best First'.
Re^2: Compare large files by boardryder (Novice) on Jul 10, 2009 at 00:48 UTC
I did try and create one file per directory based on your excellent example provided on on my other thread. After completing half of my directory scan, it ended up creating nearly 500,000 files and took nearly 30 minutes just to do a listing so I started back here again. It looks like I have several ideas to implement now and my options are clear. I'm going to attempt sorting two large files then use `comm -3`to filter the diffs as that seems the most straight forward to at least get this working. Thanks All.	[reply] [d/l]

Replies are listed 'Best First'.

Re^2: Compare large files
by boardryder (Novice) on Jul 10, 2009 at 00:48 UTC

comm -3

[reply]
[d/l]