biologistatsea has asked for the wisdom of the Perl Monks concerning the following question:
I've written a script that takes IDs from a csv file, uses those to extract certain information from a very large file (76 million line - the output of a mass spectrometer) and then writes a new file which essentially copies the original csv and appends the new information to each record. It works perfectly on small files, but takes an age (>30mins - I haven't tried longer) on files of a realistic size. I'm a perl novice, so I'm sure my code is somehow hugely inefficient - can anyone see any obvious ways in which I could speed this up? Thanks!!
#!/usr/bin/perl use strict; use warnings; my $usage = "perl mgf_rt.pl [input mgf file] [mascot csv output] [outp +ut file]\n"; my $mgfin = shift or die $usage; my $csvin = shift or die $usage; my $output = shift or die $usage; my @IDS; open (CSV, '<', $csvin); while (my $line = <CSV>){ chomp $line; my $id = 'initial'; if($line =~ /([^,]*,){30}"([^"]*)/){$id = $2; push (@IDS, $id); } } close CSV; print "Finished collecting spectra names from CSV file\n"; open (MGF, '<', $mgfin); my $A = '0'; my $B = '0'; my $TI = 'initial'; my $RT = 'initial'; my %IDrtPairs; while (my $line = <MGF>){ if ($line =~ /TITLE=(.*)/){$TI="$1"; $A = '1'; } if ($line =~ /RTINSECONDS=(.*)/){$RT="$1"; $B = '1'; } if ($A+$B == 2 && grep {$_ eq $TI} @IDS){ $IDrtPairs{"$TI"}="$RT"; $A = '0'; $B = '0'; } } close MGF; print "Finished getting RT information from MGF file\n"; open (CSV, '<', $csvin); open (OUT, '>>', $output); while (my $line = <CSV>){ chomp $line; my $id = 'initial'; if($line =~ /([^,]*,){30}"([^"]*)/){$id = $2; my $reten = "$IDrtPairs{$id}"; print OUT "$line,$reten\n"; }else{ print OUT "$line\n"; } } close CSV; close OUT;
Naturally, it's the second of the three while loops which takes ages to complete, since it reads the large file line by line. An example of the information in the very large file (filetag MGF in the code above is as follows):
SEARCH=MIS MASS=Monoisotopic BEGIN IONS TITLE=AE.36154.36154.2 (intensity=3533482168.8807) PEPMASS=358.209301553256 CHARGE=2+ SCANS=36154 RTINSECONDS=1697.984 55.05507 86438.71 56.05026 89053.36 60.0452 843930.94 60.05638 100834.36 69.07059 82593.55 70.02967 63427.3 70.0659 1222576
|
|---|