comment on

Hello again,

Recently I asked about tuning multiple regular expressions and got some good suggestions and help with benchmarking. This code is now working well. However, I want to take advantage of the fact that I'm working on a many-CPU machine and to try forking these different searches.

Essentially, my program (given at the bottom) does three things:

open a series of files one-by-one and load into memory
for each file perform many (10-30) searches
write the search results to a single file

My thought was to fork each of the separate string searches, so the string gets loaded from disk only once, and multiple CPUs can be used for searching it.

Aside from wondering if this is a reasonable approach, my main question is about implementing the output-to-disk. Can I have all the children write to the same file-handle or will this lead to corruption? And because the file output will almost certainly be the limiting step, am I being silly in parallelizing this in the first place? If there's a way to pass data back from the child to the parent, I could create one aggreated results hash and print it to disk simultaneously, for example.

Any advice/suggestions much appreciated!

Here's the code:

### INCLUDES #########################################################
+################################
use strict;
use Bio::SeqIO;
use Carp;

### PARAMETERS #######################################################
+################################
my $chr_file = $ARGV[0];
my $seq_file = $ARGV[1];

if ( 2 != scalar(@ARGV) ) {
        croak 'Invalid parameter number';
        }
elsif ( ! -e $chr_file || ! -T $chr_file ) {
        croak 'Missing or invalid chromsome-listing file';
        }
elsif ( ! -e $seq_file || ! -T $seq_file ) {
        croak 'Missing or invalid sequence-listing file';
        }

### LOCALS ###########################################################
+################################
my @chromosomes;
my %motifs;

### LOAD THE CHROMOSOME LIST #########################################
+################################
open(my $fh_chr, '<', $chr_file) or croak "Unable to open chromsome li
+st: $chr_file";

while (<$fh_chr>) {

        s/^\s+//;
        s/\s+$//;

        my $row = $_;
        next() if (!$row);

        push @chromosomes, $row;

        }

close($fh_chr);

### LOAD THE MOTIF LIST ##############################################
+################################
open(my $fh_seq, '<', $seq_file) or croak "Unable to open motif file: 
+$seq_file";

while (<$fh_seq>) {

        s/^\s+//;
        s/\s+$//;

        my @row = split("\t");

        next() if ( 2 != scalar(@row) );

        $motifs{ $row[0] } = $row[1];

        }

close($fh_seq);

### FIND SEQUENCE MOTIFS #############################################
+################################
foreach my $chromosome (@chromosomes) {

        my $directory = $chromosome.'/';
        my $file = 'chr'.$chromosome.'.fa.masked';
        my $path = $directory.$file;

        my $seqio = Bio::SeqIO->new(
                -file    =>  "<$path",
                -format  =>  'largefasta'
                );

        my $seq = $seqio->next_seq();
        my $sequence = $seq->seq();

        foreach my $motif ( keys(%motifs) ) {

                my $str = $motifs{$motif};
                my $len = length($str);
                my $pos = 0;

                while ( ($pos = index($sequence, $str, $pos)) >= 0 ) {
                        print join("\t", $chromosome, $pos, $motif), "
+\n";
                        $pos += $len;
                        }

                }

        }
[download]

In reply to Forking Multiple Regex's on a Single String by bernanke01

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.