If I understand correctly you have a large text which you break into 100mbyte chunks and then calculate the checksum on each chunk independently, i.e. you are not trying to create 1 checksum for the whole 127gbyte text.

You can save some time by setting up a pipeline: it starts by reading 1 chunk. When it is done, the checksum calculation begins in another thread and at the same time current thread reads the 2nd chunk in parallel.The savings depend on the ratio of the time reading from disk over the time calculating the checksum. The big objection here is that shared memory between threads in Perl is not efficient (perhaps someone can teach me otherwise) and you end up wasting more time in locking or in duplicating data between the threads. The alternative is the reader thread to pass data via a pipe to the calculating thread.

Also, it is worth investigating Digest::SHA's ability to add data from a stream as it becomes available over the pipe and whether some calculations can be done before the full chunk becomes available. I do not know about this.

If the problem is indeed IO-bound then compressing the files to, say, half the size will reduce IO time and increase calculation time (decompressing+caclulating). If you can make that ratio 50/50 then you have a nice candidate for a pipeline to halve your overall time (and increase power consumption). You should consider this only if you intend to repeat this experiments in the future (see paragraph below) otherwise you will end up with both longer time and bigger electricity bill.

Lastly, if you are thinking doing similar experiments using same data, perhaps one can split the file (unix split --bytes=100000000 file.dat) in advance and move it into different physical disks permanently. The cost of split+move can be worth if you intend to do these (or similar) calculations/experiments repeatedly. Suppose you move them to 3 different disks, then you can parallelise the process over 3 threads and benchmark what your OS and hardware get you on the theoretical 2/3 savings. With this setup you can save time over every experiment you make in the future but I doubt anyone has 3 physically distinct disks on a laptop.

bw, bliako


In reply to Re: SHA-256? What do you all think of this? by bliako
in thread SHA-256? What do you all think of this? by locked_user erichansen1836

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.