comment on

I've written a script that takes IDs from a csv file, uses those to extract certain information from a very large file (76 million line - the output of a mass spectrometer) and then writes a new file which essentially copies the original csv and appends the new information to each record. It works perfectly on small files, but takes an age (>30mins - I haven't tried longer) on files of a realistic size. I'm a perl novice, so I'm sure my code is somehow hugely inefficient - can anyone see any obvious ways in which I could speed this up? Thanks!!

#!/usr/bin/perl

use strict;
use warnings;


my $usage = "perl mgf_rt.pl [input mgf file] [mascot csv output] [outp
+ut file]\n";
my $mgfin = shift or die $usage;
my $csvin = shift or die $usage;
my $output = shift or die $usage;
my @IDS;

open (CSV, '<', $csvin);

while (my $line = <CSV>){
    chomp $line;
    my $id = 'initial';
    if($line =~ /([^,]*,){30}"([^"]*)/){$id = $2;
    push (@IDS, $id);
    }
}

close CSV;

print "Finished collecting spectra names from CSV file\n";

open (MGF, '<', $mgfin);

my $A = '0';
my $B = '0';
my $TI = 'initial';
my $RT = 'initial';
my %IDrtPairs;

while (my $line = <MGF>){
    if ($line =~ /TITLE=(.*)/){$TI="$1"; 
                   $A = '1';
    }
    if ($line =~ /RTINSECONDS=(.*)/){$RT="$1";
                     $B = '1';
    }
    if ($A+$B == 2 && grep {$_ eq $TI} @IDS){
    $IDrtPairs{"$TI"}="$RT";
    $A = '0';
    $B = '0';
    }
}

close MGF;

print "Finished getting RT information from MGF file\n";

open (CSV, '<', $csvin);

open (OUT, '>>', $output);

while (my $line = <CSV>){
    chomp $line;
    my $id = 'initial';
    if($line =~ /([^,]*,){30}"([^"]*)/){$id = $2;
                    my $reten = "$IDrtPairs{$id}";
                    print OUT  "$line,$reten\n";
    }else{
    print OUT "$line\n";
    }
}   

close CSV;
close OUT;
[download]

Naturally, it's the second of the three while loops which takes ages to complete, since it reads the large file line by line. An example of the information in the very large file (filetag MGF in the code above is as follows):

SEARCH=MIS
MASS=Monoisotopic
BEGIN IONS
TITLE=AE.36154.36154.2 (intensity=3533482168.8807)
PEPMASS=358.209301553256
CHARGE=2+
SCANS=36154
RTINSECONDS=1697.984
55.05507 86438.71
56.05026 89053.36
60.0452 843930.94
60.05638 100834.36
69.07059 82593.55
70.02967 63427.3
70.0659 1222576
[download]

In reply to Script far too slow with large files - wisdom needed! by biologistatsea

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.