in reply to tight loop regex optimization
Skip to the bottom first!
There are a few places in the code where small savings can be achieved essentially for free:
But even if you could make this level of savings on every single line in the program, you'd still save maybe an hour at most.
Looking at a few of the individual REs, nothing leaps off the page as being particularly extravagant. You should heed the annotation at the top of the profiling and try to remove usage of $&. This has been known to effect a substantial time saving.
The only place affected is this sub:
sub java_clean { my $contents = $_[0]; while ($contents =~ s/(\{[^\{]*)\{([^\{\}]*)\}/ $1."\05".&wash($2)/ges) {} $contents =~ s/\05/\{\}/gs; # Remove imports ##$contents =~ s/^\s*import.*;/&wash($&)/gem; $contents =~ s/(^\s*import.*;)/&wash($1)/gem; # Remove packages ##$contents =~ s/^\s*package.*;/&wash($&)/gem; $contents =~ s/(^\s*package.*;)/&wash($1)/gem; return $contents; }
The uncommented replacements should have the same effect (untested) and the changes could have a substantial affect on the overall performance of a script dominated by regex manipulations.
While you're at it, you can also add a few micro-optimisations where they are called millions of times like:
sub wash { ##### my $towash = $_[0]; return ( "\n" x ( $_[0] =~ tr/\n// ) ); }
which will save the 7 seconds spent copying the input parameter. But given that the overall runtime is 7 minutes, that's not going to have a big effect. The only way you're a likely to get substantial savings from within the script, is to try optimising the algorithms used -- which amounts to tuning all of the individual regexes; and the heuristics they represent -- and that comes with enormous risk of breaking the logic completely and would require extensive and detailed testing.
All of that said, if you split the workload across two processors, you're likely to achieve close to a 50% saving. Across 4, and a 75% saving is theoretically possible. It really doesn't make much sense to spend time looking for saving within the script when, with a little restructuring, it lends itself so readily to being parallelised.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: tight loop regex optimization
by superawesome (Initiate) on Nov 02, 2011 at 04:47 UTC | |
by BrowserUk (Patriarch) on Nov 02, 2011 at 07:02 UTC | |
by superawesome (Initiate) on Nov 03, 2011 at 00:43 UTC |