http://qs1969.pair.com?node_id=568393

bernanke01 has asked for the wisdom of the Perl Monks concerning the following question:

Hello again,

Recently I asked about tuning multiple regular expressions and got some good suggestions and help with benchmarking. This code is now working well. However, I want to take advantage of the fact that I'm working on a many-CPU machine and to try forking these different searches.

Essentially, my program (given at the bottom) does three things:

  1. open a series of files one-by-one and load into memory
  2. for each file perform many (10-30) searches
  3. write the search results to a single file

My thought was to fork each of the separate string searches, so the string gets loaded from disk only once, and multiple CPUs can be used for searching it.

Aside from wondering if this is a reasonable approach, my main question is about implementing the output-to-disk. Can I have all the children write to the same file-handle or will this lead to corruption? And because the file output will almost certainly be the limiting step, am I being silly in parallelizing this in the first place? If there's a way to pass data back from the child to the parent, I could create one aggreated results hash and print it to disk simultaneously, for example.

Any advice/suggestions much appreciated!

Here's the code:

### INCLUDES ######################################################### +################################ use strict; use Bio::SeqIO; use Carp; ### PARAMETERS ####################################################### +################################ my $chr_file = $ARGV[0]; my $seq_file = $ARGV[1]; if ( 2 != scalar(@ARGV) ) { croak 'Invalid parameter number'; } elsif ( ! -e $chr_file || ! -T $chr_file ) { croak 'Missing or invalid chromsome-listing file'; } elsif ( ! -e $seq_file || ! -T $seq_file ) { croak 'Missing or invalid sequence-listing file'; } ### LOCALS ########################################################### +################################ my @chromosomes; my %motifs; ### LOAD THE CHROMOSOME LIST ######################################### +################################ open(my $fh_chr, '<', $chr_file) or croak "Unable to open chromsome li +st: $chr_file"; while (<$fh_chr>) { s/^\s+//; s/\s+$//; my $row = $_; next() if (!$row); push @chromosomes, $row; } close($fh_chr); ### LOAD THE MOTIF LIST ############################################## +################################ open(my $fh_seq, '<', $seq_file) or croak "Unable to open motif file: +$seq_file"; while (<$fh_seq>) { s/^\s+//; s/\s+$//; my @row = split("\t"); next() if ( 2 != scalar(@row) ); $motifs{ $row[0] } = $row[1]; } close($fh_seq); ### FIND SEQUENCE MOTIFS ############################################# +################################ foreach my $chromosome (@chromosomes) { my $directory = $chromosome.'/'; my $file = 'chr'.$chromosome.'.fa.masked'; my $path = $directory.$file; my $seqio = Bio::SeqIO->new( -file => "<$path", -format => 'largefasta' ); my $seq = $seqio->next_seq(); my $sequence = $seq->seq(); foreach my $motif ( keys(%motifs) ) { my $str = $motifs{$motif}; my $len = length($str); my $pos = 0; while ( ($pos = index($sequence, $str, $pos)) >= 0 ) { print join("\t", $chromosome, $pos, $motif), " +\n"; $pos += $len; } } }