comment on

Hi Monks,

I am approaching a problem where I've downloaded several hundred files that are about 20GB each. I need to checksum each file and compare it to the provided value to make sure each file is correct. Md5sum takes a while for files that large, and I thought I could speed this up if I ran it in parallel.
I added Parallel::Forkmanager to my repertoire for the download itself. I went ahead and just added it blindly, curious to see if the single file that it was writing to would be misformatted - and it was =D

I attempted to solve it like so:

#!/usr/bin/perl -w
#testlock.pl
use strict;
use Parallel::ForkManager;
use Fcntl qw(:flock SEEK_END);
my @timenow = localtime;
open (my $out, ">", "output_" . $timenow[1] . "_" . $timenow[0] . ".tx
+t") || die "Could not open output: $!\n";
my $stdout = select ($out);
$| = 1;
select ($stdout);
my @files = (1 ..100);
my $fork = new Parallel::ForkManager(8);
foreach my $file (@files){
    $fork->start and next;
    my $checksum = "md5sum $file";
    flock($out, LOCK_EX) or die "Cannot lock filehandle - $!\n";
    seek($out, 0, SEEK_END) or die "Cannot seek - $!\n";
    print $out "Analysis for file $file\n\tchecksum $checksum\n";
    flock($out, LOCK_UN) or die "Cannot unlock filehandle - $!\n";
    $fork->finish;
}
$fork->wait_all_children;
close $out;
[download]

However, when running this file a hundred times, I noticed a significant number of the files came out different sizes. Here is my understanding of the situation (please correct me kindly if I'm off base):
The filehandle that I open before the loop is shared across processes (in perlfunc). The seek pointer is maintained in a shared fashion. A problem can occur when two processes/threads simultaneously write before either can update the seek pointer, so effectively one overwrites the other at that position.
I thought that the flock call would prevent this, by requiring each thread to request and respect a lock before writing. I also thought that a file write might be buffering, which may have caused the issue.

I went back and tried this without sharing a filehandle:

#!/usr/bin/perl -w
#test.pl
use strict;
use local::lib;
use LWP::Simple;
use Cwd;
use Parallel::ForkManager;
my @timenow = localtime;
my @files = (1 ..100);
my $fork = new Parallel::ForkManager(8);
foreach my $file (@files){
    $fork->start and next;
    open (my $out, ">>", "output_newfh_" . $timenow[1] . "_" . $timeno
+w[0] . ".txt") || die "Could not open output: $!\n";
    my $checksum = "md5sum $file";
    print $out "Analysis for file $file\n\tchecksum $checksum\n";
    close $out;
    $fork->finish;
}
$fork->wait_all_children;
[download]

This works as expected - all the file sizes are the same throughout many trials. My questions are, why didn't the first strategy work, is there something happening when a separate file handle is opened (is there automatic blocking somewhere?) that prevents overwrites, and can I guarantee that this tactic will be correct? Would it make more sense to try this using ithreads instead?

The order of the output is not important, but it must all be there. I'm running this on red hat 6 with perl 5.10. The system has flock(2) and fork. The files are all genomic data in bam format. The underlying filesystem is Lustre, which I'm hoping will play nice with the heavy I/O of the md5 call in this program. In the example programs above, I simplified the code as much as possible.

In reply to Multi-threaded behavior, file handle dups and writing to a file by cganote

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.