I can't answer your question because you don't give enough details. But I am doing a lot of file comparisons at $work, most of the time with very large files. Various strategies permit to avoid loading all of them into memory. But a lot depends on the details. For example, are you looking for what we call "orphans", i.e. records that are in file 1 and not in file 2, or the other way around? Or are you rather looking for differences between records that have the same identifying key? Or both? Are you looking for common records, or are you looking for differences? The answer to this question may lead to an entirely different strategy.

Sometimes, you can load just one file into memory and then scan the other files one by one and, for each file, line by line, without ever loading the other entire files into memory. And, as a second step, compare the generated files containing the differences between the other files and file 1, which may (or may not) be much smaller than the original files, depending on your data shape.

Another approach (especially if the files are truly huge) is to sort the files according to the comparison key prior to the comparison and then read all of your files line by line in parallel. There is a penalty in sorting the files before the comparison, but it is often worth the cost, because the multifile comparison is then much faster. And, depending on where tour files are coming from, some of them may already be sorted.

Each case is different, so that there is no general strategy blindly applicable to your specific problem, and this is why I can't suggest a solution without knowing in details what you're really comparing and what kind of differences (or common records) you're looking for.


In reply to Re^3: write to Disk instead of RAM without using modules by Laurent_R
in thread write to Disk instead of RAM without using modules by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.