Re: An efficient way to parallelize loops
by moritz (Cardinal) on Jun 01, 2010 at 09:04 UTC
|
Are you sure it's the contents of the loop that are slow, and not the reading of the ZIP file? Try Devel::NYTProf to find the acutal bottleneck.
Also there are some things you can improve within the loop before investigating parallelism.
For one you can look up $categories{$k}->{traces} once outside the whole loop.
It might also be much faster to join all those regexes together to a single regex and match it once, instead of iterating over the regexes (Don't know if that works in your case).
Also you seem to read the whole file into memory first, and then iterate over it - that's rather inefficient. Instead use
OUTER: while(my $line = <GZIP>) {
...
to read it line by line.
Parallelization is usually a lot of trouble, so try the conventional optimization wisdom first.
(Update: removed one comment that's not applicable; added hint aboute memory usage). | [reply] [d/l] [select] |
|
|
It might also be much faster to join all those regexes together to a single regex and match it once, instead of iterating over the regexes (Don't know if that works in your case).
And it also may be a lot slower. Combining the patterns may cause the optimizer to give up much earlier (or not kick in at all). OTOH, the code as given would benefit from a combined pattern in the sense than no recompilation is needed at all (but there are other ways to achieve that). OP should do some benchmarking to see whether combining the patterns is an improvement or not. (If OP needs to know which "branch" of an alternation matched, the OP could make use of the (*:NAME) construct - provided the OP uses 5.10 or later).
| [reply] [d/l] |
|
|
Hi moritz
First off, thanks for you kind reply.
Your first point, unluckely is not applicable, due to the fact that depending on the element in @{$categories{$k}->{traces}} matched, the sum below will be applied to a different element in the @lista array.
I was trying to find other solutions, actually, to read the file. Does the while loop run faster than doing a foreach loop, or putting the whole file into an array, and then reading the array line by line?
Thanks for the help!
| [reply] |
|
|
my @a = @{$categories{$k}->{traces};
OUTER: while( my $line = <GZIP> ) {
for ( my $i = 0; $i < @a }; $i++ ) {
if ( $line =~ /^($a[$i]->{regex})/ ) {
my @lista = split /;/, $line;
$A += $lista[$a[$i]->{calc}}[0]];
$B += $lista[$a[$i]->{calc}}[1]];
$C += $lista[$a[$i]->{calc}}[2]];
next OUTER;
}
}
}
This should do exactly the same as your code, only more efficient.
The version with while is at least as fast as the version with for, and uses much less memory.
Still I'd like to emphasize my first point again: Benchmark and profile before starting to optimize (and before even thinking of parallelization). | [reply] [d/l] [select] |
|
|
|
|
|
|
| [reply] |
|
|
| [reply] |
|
|
|
|
Re: An efficient way to parallelize loops
by JavaFan (Canon) on Jun 01, 2010 at 09:49 UTC
|
I don't know how complex the $categories{$k}->{traces}[$i]->{regex} patterns are, but due to you looping over a set of patterns for each line, you do a regexp compile for each inner loop. You may want to precompile the patterns (note that even if $categories{$k}->{traces}[$i]->{regex} is a qr// construct, the fact you are using it inside a larger pattern (the anchor and the parens) makes that it gets stringified and recompiled each time. Alternatively, you may swap the inner and outer loop (that is, for each pattern, loop over the file) - but you'll have to do some benchmarking, whether or not it's faster depends on all kinds of factors.
And as others already have pointed out - use a while loop when iterating over the handle instead of a foreach. | [reply] [d/l] [select] |
|
|
Hi javaFan
Thanks for your help. The patterns are not complex, but long, since they're made of all the elements of long arrays, separated by "|" (with the intent of alternatively match different patterns).
As you suggested before, I already precompile the regex with qr// operator, which already sped up the code.
I'll try to swap the loops, though, to see if there's any improvement, and I'll do some profiling as well.
Many thanks again for your help!
| [reply] |
|
|
The patterns are not complex, but long, since they're made of all the elements of long arrays, separated by "|" (with the intent of alternatively match different patterns).
Then you'll likely benefit from running perl 5.10 or newer, since it implements a Trie optimization for alternations of literal patterns. If the arrays are really huge, you could increase the value of ${^RE_TRIE_MAXBUF} to make them all fit into the same trie.
Perl 6 - links to (nearly) everything that is Perl 6.
| [reply] |
|
|
Re: An efficient way to parallelize loops
by BrowserUk (Patriarch) on Jun 01, 2010 at 20:33 UTC
|
- I'm surprised no one has asked you a) how many regexes there are; b) to show a few 'typical examples' of the regexes involved.
Beyond the 5.10 trie optimisation--which also has some limitations--there are other ways of optimising the use of multiple regexes against single buffers, but they do tend to vary with the nature of the regexes involved.
- On the basis of what you've shown so far, it looks like this task might be effectively parallelised using threads
However, your need to stick with 5.8.0--a time when threads were still quite flakey--means I would be reluctant to suggest that solution unless you can upgrade to at least 5.8.5--though 5.8.9 or 5.10+ would be far better.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
...
records => [ 'MDR', 'TCBMDR', 'INSS7MDR', 'TCBINSS7MDR' ],
...
for ( my $i = 0; $i < scalar @{$categories{$k}->{tracciati}}; $i++ ) {
my $TestReStr = join("|", map { "${_}" } @{$categories
+{$k}->{traces}[$i]->{records}} );
$categories{$k}->{traces}[$i]->{regex} = qr/$TestReStr
+/;
}
Note that there are many of 'records' keys.
I would really like to upgrade to a higher version than 5.8.0, but it's not really possible, due to the fact that the sysadmins don't do that on this machines.
Thanks for your help though | [reply] [d/l] |
|
|
Note that there are many of 'records' keys..
Sorry, but "many" is not a number. 4? 40? 4e40?
I would really like to upgrade to a higher version than 5.8.0, but it's not really possible, due to the fact that the sysadmins don't do that on this machines.
If you have your own personal machine, the installing Perl locally is very easy to do.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|