comment on

Hello biologistatsea,

The following is a parallel demonstration for your use case. The comments are inlined inside the code. This helps address the concern with slow processing for the 2nd loop.

#!/usr/bin/perl

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

my $usage  = "perl mgf_rt.pl [input mgf file] [mascot csv output] [out
+put file]\n";
my $mgfin  = shift or die $usage;
my $csvin  = shift or die $usage;
my $output = shift or die $usage;

my (%IDS, $CSV, $OUT);

# Instantiate a shared HASH for use by MCE workers.
my $IDrtPairs = MCE::Shared->hash();

# Obtain IDs from the CSV file.
open $CSV, '<', $csvin or die "open error ($csvin): $!\n";

while ( <$CSV> ) {
    # ?: means not to capture the 1st ( )
    # therefore, $1 refers to the 2nd ( )
    $IDS{ $1 } = 1 if /^(?:[^,]*,){30}"([^"]*)/;
}

close $CSV;

print "Finished collecting spectra names from CSV file\n";

# Process huge file by record separator. The "\nTITLE" RS is a
# special case which anchors "TITLE" at the start of the line.
# Workers receive records beginning with "TITLE" and ending in
# "\n". A chunk_size greater than 8192 means to read # of bytes.
# A worker completes reading the rest of the record before the
# next worker reads the next chunk.

mce_flow_f {
    max_workers => 4,
    chunk_size  => "2m",
    RS          => "\nTITLE",
},
sub {
    my ( $mce, $chunk_ref, $chunk_id ) = @_;
    my %pairs;

    # Collect pairs locally.
    for my $rec ( @{ $chunk_ref } ) {
        my @match = $rec =~ /^(?:TITLE|RTINSECONDS)=([^\n]+)/mg;
        $pairs{ $match[0] } = $match[1] if ( @match == 2 );
    }

    # Send pairs to the shared manager process.
    $IDrtPairs->mset(%pairs) if %pairs;

}, $mgfin;

# Shutdown the MCE workers.
MCE::Flow::finish;

print "Finished getting RT information from MGF file\n";

# Export/destroy the shared HASH into a local copy. Basically,
# to not involve the shared manager process for the next step.
$IDrtPairs = $IDrtPairs->destroy;

# Output to a new CSV file.
open $CSV, '<', $csvin  or die "open error ($csvin): $!\n";
open $OUT, '>', $output or die "open error ($output): $!\n";

while ( my $line = <$CSV> ) {
    chomp $line;
    if ( $line =~ /^(?:[^,]*,){30}"([^"]*)/ ) {
        my $reten = $IDrtPairs->get($1);
        print $OUT "$line,$reten\n";
    }
    else {
        print $OUT "$line\n";
    }
}

close $CSV;
close $OUT;
[download]

Perl is fun, Mario.

In reply to Re: Script far too slow with large files - wisdom needed! by marioroy
in thread Script far too slow with large files - wisdom needed! by biologistatsea

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.