I recall making the suggestion in your previous thread, which got you started down this path, and part of my suggestion was to make sure that these output files themselves be created in sorted order, so that you would not have to sort them later, and comparison of two files would be much easier.

But based on the data sample you showed in one of your replies here, it looks like the files are not sorted. So the problem you need to fix is in the program that produces these files -- they should be written in sorted order.

Then you can use the standard "diff" utility, which will correctly show:

And "diff" already knows how to manage big files -- it might take a while, but I'm pretty sure it will finish.

Also, it might help if you consider breaking your outputs into smaller pieces. How hard/bad would it be to have your directory scan process create 10 files of 100 MB each on average (or 100 files of 10 MB each on average)? I think the directory structure should provide a sensible way to do that...

(update/ In fact, it might be worthwhile to simply create one tabulation file per directory -- I believe you start with a list of the directories being scanned, so the task becomes: create and compare table files for each directory in the list; that should be pretty simple to maintain, and will run as quick as any other approach. /update)

One last point, again based on the data sample you posted above. Are you sure that all differences are equally important and relevant? If yes, then using diff is fine. If not, either adjust the script that creates these files, to avoid cases where unimportant differences are present in the data, or else you'll have to write your own customized perl variant of diff (or better yet, a filter on the output from diff) to exclude unimportant differences.


In reply to Re: Compare large files by graff
in thread Compare large files by boardryder

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.