Re: Script far too slow with large files

Hello biologistatsea,

The following is a parallel demonstration for your use case. The comments are inlined inside the code. This helps address the concern with slow processing for the 2nd loop.

#!/usr/bin/perl

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

my $usage  = "perl mgf_rt.pl [input mgf file] [mascot csv output] [out
+put file]\n";
my $mgfin  = shift or die $usage;
my $csvin  = shift or die $usage;
my $output = shift or die $usage;

my (%IDS, $CSV, $OUT);

# Instantiate a shared HASH for use by MCE workers.
my $IDrtPairs = MCE::Shared->hash();

# Obtain IDs from the CSV file.
open $CSV, '<', $csvin or die "open error ($csvin): $!\n";

while ( <$CSV> ) {
    # ?: means not to capture the 1st ( )
    # therefore, $1 refers to the 2nd ( )
    $IDS{ $1 } = 1 if /^(?:[^,]*,){30}"([^"]*)/;
}

close $CSV;

print "Finished collecting spectra names from CSV file\n";

# Process huge file by record separator. The "\nTITLE" RS is a
# special case which anchors "TITLE" at the start of the line.
# Workers receive records beginning with "TITLE" and ending in
# "\n". A chunk_size greater than 8192 means to read # of bytes.
# A worker completes reading the rest of the record before the
# next worker reads the next chunk.

mce_flow_f {
    max_workers => 4,
    chunk_size  => "2m",
    RS          => "\nTITLE",
},
sub {
    my ( $mce, $chunk_ref, $chunk_id ) = @_;
    my %pairs;

    # Collect pairs locally.
    for my $rec ( @{ $chunk_ref } ) {
        my @match = $rec =~ /^(?:TITLE|RTINSECONDS)=([^\n]+)/mg;
        $pairs{ $match[0] } = $match[1] if ( @match == 2 );
    }

    # Send pairs to the shared manager process.
    $IDrtPairs->mset(%pairs) if %pairs;

}, $mgfin;

# Shutdown the MCE workers.
MCE::Flow::finish;

print "Finished getting RT information from MGF file\n";

# Export/destroy the shared HASH into a local copy. Basically,
# to not involve the shared manager process for the next step.
$IDrtPairs = $IDrtPairs->destroy;

# Output to a new CSV file.
open $CSV, '<', $csvin  or die "open error ($csvin): $!\n";
open $OUT, '>', $output or die "open error ($output): $!\n";

while ( my $line = <$CSV> ) {
    chomp $line;
    if ( $line =~ /^(?:[^,]*,){30}"([^"]*)/ ) {
        my $reten = $IDrtPairs->get($1);
        print $OUT "$line,$reten\n";
    }
    else {
        print $OUT "$line\n";
    }
}

close $CSV;
close $OUT;
[download]

Perl is fun, Mario.

Comment on Re: Script far too slow with large files - wisdom needed! Download Code