in reply to An efficient way to parallelize loops

Are you sure it's the contents of the loop that are slow, and not the reading of the ZIP file? Try Devel::NYTProf to find the acutal bottleneck.

Also there are some things you can improve within the loop before investigating parallelism.

For one you can look up $categories{$k}->{traces} once outside the whole loop.

It might also be much faster to join all those regexes together to a single regex and match it once, instead of iterating over the regexes (Don't know if that works in your case).

Also you seem to read the whole file into memory first, and then iterate over it - that's rather inefficient. Instead use

OUTER: while(my $line = <GZIP>) { ...

to read it line by line.

Parallelization is usually a lot of trouble, so try the conventional optimization wisdom first.

(Update: removed one comment that's not applicable; added hint aboute memory usage).

Replies are listed 'Best First'.
Re^2: An efficient way to parallelize loops
by JavaFan (Canon) on Jun 01, 2010 at 10:05 UTC
    It might also be much faster to join all those regexes together to a single regex and match it once, instead of iterating over the regexes (Don't know if that works in your case).
    And it also may be a lot slower. Combining the patterns may cause the optimizer to give up much earlier (or not kick in at all). OTOH, the code as given would benefit from a combined pattern in the sense than no recompilation is needed at all (but there are other ways to achieve that). OP should do some benchmarking to see whether combining the patterns is an improvement or not. (If OP needs to know which "branch" of an alternation matched, the OP could make use of the (*:NAME) construct - provided the OP uses 5.10 or later).
Re^2: An efficient way to parallelize loops
by Deus Ex (Scribe) on Jun 01, 2010 at 09:25 UTC

    Hi moritz

    First off, thanks for you kind reply.

    Your first point, unluckely is not applicable, due to the fact that depending on the element in @{$categories{$k}->{traces}} matched, the sum below will be applied to a different element in the @lista array.

    I was trying to find other solutions, actually, to read the file. Does the while loop run faster than doing a foreach loop, or putting the whole file into an array, and then reading the array line by line?

    Thanks for the help!

      Your first point, unluckely is not applicable, due to the fact that depending on the element in @{$categories{$k}->{traces}} matched, the sum below will be applied to a different element in the @lista array.

      I don't understand what you're saing. The code you showed us changes neither %categories nor $k. Why should anything change if you do the lookup outside the loop?

      my @a = @{$categories{$k}->{traces}; OUTER: while( my $line = <GZIP> ) { for ( my $i = 0; $i < @a }; $i++ ) { if ( $line =~ /^($a[$i]->{regex})/ ) { my @lista = split /;/, $line; $A += $lista[$a[$i]->{calc}}[0]]; $B += $lista[$a[$i]->{calc}}[1]]; $C += $lista[$a[$i]->{calc}}[2]]; next OUTER; } } }

      This should do exactly the same as your code, only more efficient.

      The version with while is at least as fast as the version with for, and uses much less memory.

      Still I'd like to emphasize my first point again: Benchmark and profile before starting to optimize (and before even thinking of parallelization).

        Dammit! I misunderstood what you wrote before: i had intended you suggested to move the whole "if" statement outside of the loop, not the lookup for the variables' values in the statement. Sorry! You were right.

        I'll also do the profiling, to see where the bottlenecks are.

        Thank you very much again!

        Do you think that moving outside the loop the split() would be useful? I also thought to do like this:

        open(GZIP, "<:gzip", "$path") or die "$!\n"; my @a = @{$categories{$k}->{traces}}; LOOP: while ( my $line = <GZIP>) { my @lista = split /;/, $line; my $head = $lista[0]; for ( my $i = '0'; $i < scalar @a; $i+ ++ ) { if ( $head =~ /^($a[$i]->{rege +x})$/ ) { $A += $lista[$a[$i]->{ +calc}[0]]; $B += $lista[$a[$i]->{ +calc}[1]]; $C += $lista[$a[$i]->{ +calc}[2]]; next LOOP; } } } close(GZIP);

        I put the first element of the @lista array as the pattern to test, cause it is there; no need to test the whole line.

Re^2: An efficient way to parallelize loops
by Deus Ex (Scribe) on Jun 01, 2010 at 12:51 UTC

    Unfortunately the module Devel::NYTProf only works with perl 5.8.1, while I've got only a 5.8.0 version. How can I profile/benchmark with the tools I have?

    Thanks again

        If I could, I would already :)

        Unfortunately, I can't do upgrade, cause I just develop on that machine, and have no administration power.

        Thanks though