biologistatsea has asked for the wisdom of the Perl Monks concerning the following question:

I've written a script that takes IDs from a csv file, uses those to extract certain information from a very large file (76 million line - the output of a mass spectrometer) and then writes a new file which essentially copies the original csv and appends the new information to each record. It works perfectly on small files, but takes an age (>30mins - I haven't tried longer) on files of a realistic size. I'm a perl novice, so I'm sure my code is somehow hugely inefficient - can anyone see any obvious ways in which I could speed this up? Thanks!!

#!/usr/bin/perl use strict; use warnings; my $usage = "perl mgf_rt.pl [input mgf file] [mascot csv output] [outp +ut file]\n"; my $mgfin = shift or die $usage; my $csvin = shift or die $usage; my $output = shift or die $usage; my @IDS; open (CSV, '<', $csvin); while (my $line = <CSV>){ chomp $line; my $id = 'initial'; if($line =~ /([^,]*,){30}"([^"]*)/){$id = $2; push (@IDS, $id); } } close CSV; print "Finished collecting spectra names from CSV file\n"; open (MGF, '<', $mgfin); my $A = '0'; my $B = '0'; my $TI = 'initial'; my $RT = 'initial'; my %IDrtPairs; while (my $line = <MGF>){ if ($line =~ /TITLE=(.*)/){$TI="$1"; $A = '1'; } if ($line =~ /RTINSECONDS=(.*)/){$RT="$1"; $B = '1'; } if ($A+$B == 2 && grep {$_ eq $TI} @IDS){ $IDrtPairs{"$TI"}="$RT"; $A = '0'; $B = '0'; } } close MGF; print "Finished getting RT information from MGF file\n"; open (CSV, '<', $csvin); open (OUT, '>>', $output); while (my $line = <CSV>){ chomp $line; my $id = 'initial'; if($line =~ /([^,]*,){30}"([^"]*)/){$id = $2; my $reten = "$IDrtPairs{$id}"; print OUT "$line,$reten\n"; }else{ print OUT "$line\n"; } } close CSV; close OUT;

Naturally, it's the second of the three while loops which takes ages to complete, since it reads the large file line by line. An example of the information in the very large file (filetag MGF in the code above is as follows):

SEARCH=MIS MASS=Monoisotopic BEGIN IONS TITLE=AE.36154.36154.2 (intensity=3533482168.8807) PEPMASS=358.209301553256 CHARGE=2+ SCANS=36154 RTINSECONDS=1697.984 55.05507 86438.71 56.05026 89053.36 60.0452 843930.94 60.05638 100834.36 69.07059 82593.55 70.02967 63427.3 70.0659 1222576

Replies are listed 'Best First'.
Re: Script far too slow with large files - wisdom needed!
by toolic (Bishop) on Jan 21, 2016 at 16:29 UTC
    I doubt it will speed things up much, but looking through a hash is generally quicker than looking through an array. Change @IDS to %IDS, then you don't need the grep.
    $IDS{$id}++; ... if ($A+$B == 2 && (exists $IDS{$TI})){
      Turning the second if into an elsif might also save some microseconds.
      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Script far too slow with large files - wisdom needed!
by Eily (Monsignor) on Jan 21, 2016 at 17:32 UTC

    Your MGF file seems to be divided into records (blocks). If the first lines of a record are always the same information in the same order, you can skip the useless lines. For that, you should read the file record by record instead of line by line (which would be better than your method for finding pairs, if for any reason the TITLE line doesn't match for one record, you'll match the RT of a record with the TITLE of the next, and so on until the next error). This can be done using the input record separator, or $/. If your records are paragraph (separated by at least one empty line, but contain no empty lines themselves) you can do something like:

    use constant { TITLE => 3, RT => 7 }; { # extra block so that $/ retrieve its previous value after reading t +his file local $/ = ""; # Paragraph mode while (my $record= <MGF>) { my @lines = split "\n", $record; $TI = "$1" if $lines[TITLE] =~ /^TITLE=(.*)/; $RT = "$1" if $lines[RT] =~ /^RTINSECONDS=(.*)/; } }

    If the records are not paragraphs, this could probably work by setting $/ = "SEARCH" if all records start with it.

    If some of the fields in the records may be missing, and you can't use their position, reading record by record can still help you skip the lines at the end once you have found what you are looking for:

    { local $/ = ""; # Paragraph mode RECORD: while (my $record = <MGF>) { my @lines = split "\n", $record ; $TI = undef; $RT = undef; for my $line (@lines) { $TI = "$1" if $line =~ /^TITLE=(.*)/; $RT = "$1" if $line =~ /^RTINSECONDS=(.*)/; next RECORD if defined $TI and defined $RT; # skip to next recor +d } } }

    In any case, /^TITLE=(.*)/ is a little faster than /TITLE=(.*)/ because it tells perl to stop searching if TITLE isn't at the beginning of the line.

    Edit : added missing end of code sections

Re: Script far too slow with large files - wisdom needed!
by BrowserUk (Patriarch) on Jan 21, 2016 at 17:33 UTC

    The simplest way to speed up your second loop would be to skip over all the non-title lines:

    my( $ti, $rt ); { { local $/ = "TITLE="; <MGF>; } ( $ti ) = <MGF> =~ m[^(\S+)]; { local $/ = "RTINSECONDS="; <MGF> } ( $rt ) = <MGF> =~ m[^(\S+)]; if( exists $IDS{ $ti } ) { $IDrtPairs{ $ti } = $rt; } redo unless eof( MGF ); }

    Warning untested code that probably needs a couple of tweaks, but should explain the idea.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Script far too slow with large files - wisdom needed!
by GrandFather (Saint) on Jan 21, 2016 at 20:14 UTC

    Using your "realistic time" data set profile your code (see Devel::Profile) and see where energy needs to be focused to improve performance. You can then measure performance improvements and change focus as appropriate.

    Yet another guess at performance improvement is to simply read the entire file into memory. With multiple gigabytes of memory in most typical computers that really shouldn't be an issue. You can then run a regex over the whole thing to edit lines in place then print the result in one hit.

    Premature optimization is the root of all job security
Re: Script far too slow with large files - wisdom needed!
by RichardK (Parson) on Jan 21, 2016 at 17:53 UTC

    It might improve things a bit if you rearrange your if statements, you don't need to check ($A + $B == 2...) for each line of the input file, as you know that $B can only change when you've matched RTINSECONDS so you could do this instead :-

    if ($line =~ /TITLE=(.*)/) { $title=$1; } elsif ($line =~ /RTINSECONDS=(.*)/) { $RT=$1; if ($title && grep {$_ eq $TI} @IDS) { $IDrtPairs{$TI}=$RT; $title = undef; } }

    As already mentioned it will be better to use a hash, so you should do that too :)

    The example input data looks like TITLE & RTINSECONDS are always at the beginning of the line so you could anchor your regexs too, so it can reject non matching lines quicker. i.e. /^TITLE=(.*)/

Re: Script far too slow with large files - wisdom needed!
by hdb (Monsignor) on Jan 21, 2016 at 17:40 UTC

    If this is not an one-off exercise use a database system.

Re: Script far too slow with large files - wisdom needed!
by flexvault (Monsignor) on Jan 22, 2016 at 08:52 UTC

    Hello biologistatsea,

    You've gotten some good pointers, but no-one has mentioned using 'index' instead of a regex. 'index' and 'substr' are extremely fast if used properly. (Untested example).

    while (my $line = <MGF>) { my $si = index( $line, 'TITLE=' ); ## Find start of 'TITL +E=' string from beginning if ( $si >= 0 ) { my $sj = index( $line, "\n", $si + 5 ); ## Get data after 'TIT +LE=' if ( $sj > $sj ) { $TI = substr( $line, $si+6, $sj-($si+6) ); +} } . . . }

    Remember 'index' returns '-1' if the string is not found, so you have to test for a positive number. ( Can't do 'if ( $si )' since '-1' is 'not 0' and is TRUE.

    I've seen 10 to 50 times improvement on many-GByte files. YMMV.

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

      I've gotten good results from this too. In cases where you have a big regex which is too complicated to break down, you can do both: first, use index to do a simpler test. If that passes, then go on to use the full regex.
Re: Script far too slow with large files - wisdom needed!
by GotToBTru (Prior) on Jan 21, 2016 at 16:36 UTC

    You can gain some time by using a search for id that stops when it finds any match instead of continuing thru the whole list.

    use List::Util qw/any/; ... # if ($A+$B == 2 && grep {$_ eq $TI} @IDS){ if ($A && $B && any { $_ eq $TI } @IDS) {

    Update: toolic's idea is better.

    But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)

Re: Script far too slow with large files - wisdom needed!
by QuillMeantTen (Friar) on Jan 22, 2016 at 08:25 UTC

    Greetings,
    I second what GrandFather said about devel profile. You might want to have a look at FFI::Platypus It should allow you to rewrite the routine pointed at by the profiler in a more optimised way (namely as a C library or even in assembler I guess)
    Keep in mind that first I'd try and find more patterns in the data (as proposed by other posters) so I can get as much leverage as I can before doing a rewrite at a lower level

Re: Script far too slow with large files - wisdom needed!
by biologistatsea (Acolyte) on Jan 22, 2016 at 11:09 UTC
    Many thanks to everyone for so many ideas! Lots to think about and learn here, and hopefully I'll get it to run more efficiently.
Re: Script far too slow with large files - wisdom needed!
by marioroy (Prior) on Apr 27, 2016 at 04:27 UTC

    Hello biologistatsea,

    The following is a parallel demonstration for your use case. The comments are inlined inside the code. This helps address the concern with slow processing for the 2nd loop.

    #!/usr/bin/perl use strict; use warnings; use MCE::Flow; use MCE::Shared; my $usage = "perl mgf_rt.pl [input mgf file] [mascot csv output] [out +put file]\n"; my $mgfin = shift or die $usage; my $csvin = shift or die $usage; my $output = shift or die $usage; my (%IDS, $CSV, $OUT); # Instantiate a shared HASH for use by MCE workers. my $IDrtPairs = MCE::Shared->hash(); # Obtain IDs from the CSV file. open $CSV, '<', $csvin or die "open error ($csvin): $!\n"; while ( <$CSV> ) { # ?: means not to capture the 1st ( ) # therefore, $1 refers to the 2nd ( ) $IDS{ $1 } = 1 if /^(?:[^,]*,){30}"([^"]*)/; } close $CSV; print "Finished collecting spectra names from CSV file\n"; # Process huge file by record separator. The "\nTITLE" RS is a # special case which anchors "TITLE" at the start of the line. # Workers receive records beginning with "TITLE" and ending in # "\n". A chunk_size greater than 8192 means to read # of bytes. # A worker completes reading the rest of the record before the # next worker reads the next chunk. mce_flow_f { max_workers => 4, chunk_size => "2m", RS => "\nTITLE", }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my %pairs; # Collect pairs locally. for my $rec ( @{ $chunk_ref } ) { my @match = $rec =~ /^(?:TITLE|RTINSECONDS)=([^\n]+)/mg; $pairs{ $match[0] } = $match[1] if ( @match == 2 ); } # Send pairs to the shared manager process. $IDrtPairs->mset(%pairs) if %pairs; }, $mgfin; # Shutdown the MCE workers. MCE::Flow::finish; print "Finished getting RT information from MGF file\n"; # Export/destroy the shared HASH into a local copy. Basically, # to not involve the shared manager process for the next step. $IDrtPairs = $IDrtPairs->destroy; # Output to a new CSV file. open $CSV, '<', $csvin or die "open error ($csvin): $!\n"; open $OUT, '>', $output or die "open error ($output): $!\n"; while ( my $line = <$CSV> ) { chomp $line; if ( $line =~ /^(?:[^,]*,){30}"([^"]*)/ ) { my $reten = $IDrtPairs->get($1); print $OUT "$line,$reten\n"; } else { print $OUT "$line\n"; } } close $CSV; close $OUT;

    Perl is fun, Mario.