You don't explain what you mean by "unique structure", or why you need to load the file data into hashes in order to compare them, so I'm going to take it as read that you do need to do that. But as pointed out above, putting huge files into hashes will require vast amounts more memory to hold the data, than they require on disk. If you multiply by a factor of 5 you won't be far wrong if it is a flat hash you are building. If you need a more complicated nested structure, you will probably need to use a higher multiplier.

You are unlikely to see any performance benefit from reading two files in parallel, unless they exist on different drives. Performance will be limited by the seek performance and throughput of the drive, and accessing two huge files in parallel on the same drive will exacerbate the problems.

Think of it like trying to read two different chapters in a book in parallel. The read head (your eyes) will be constantly flicking back and forth between the front and the back of the book.

If you can arrange for them to be on separate drives, then there will (probably; controllers and others factors can also be an influence), some gains to be had in reading in parallel. But, building the hashes in different threads and then comparing them is a bad idea. The internal and user level locking required to prevent corruption will severely impact performance.

Assuming different drives and the necessity to build hashes. You would be far better off reading the files line by line on separate threads and then convey the lines to a third thread that would perform the hash building and comparisons.

That said, unless you can perform partial comparisons on the fly and conserve memory by discarding parts of the structures built as you finish with them, then you are likely to be constrained by memory.

And if you can discard chunks of memory before you have read the files completely, then one wonders why you need to build the structures in the first place. Wouldn't a line by line comparison be possible?

The bottom line is that whether there is any benefit in parallelising your program depends entirely upon the nature of your data, and the circumstances of your hardware setup, and you have not described either in sufficient detail to allow anyone to give you good advice.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re^3: changing parameters in a thread by BrowserUk
in thread changing parameters in a thread by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.