comment on

Skip to the bottom first!

There are a few places in the code where small savings can be achieved essentially for free:

1 while $re; runs ~10% faster than while( $re ) {}.
$re && last; runs a few percent faster than $re && do{ last };

But even if you could make this level of savings on every single line in the program, you'd still save maybe an hour at most.

Looking at a few of the individual REs, nothing leaps off the page as being particularly extravagant. You should heed the annotation at the top of the profiling and try to remove usage of $&. This has been known to effect a substantial time saving.

The only place affected is this sub:

sub java_clean {
    my $contents = $_[0];
    while ($contents =~ s/(\{[^\{]*)\{([^\{\}]*)\}/
       $1."\05".&wash($2)/ges) {}
    $contents =~ s/\05/\{\}/gs;

    # Remove imports
    ##$contents =~ s/^\s*import.*;/&wash($&)/gem;
    $contents =~ s/(^\s*import.*;)/&wash($1)/gem;

    # Remove packages
    ##$contents =~ s/^\s*package.*;/&wash($&)/gem;
    $contents =~ s/(^\s*package.*;)/&wash($1)/gem;

    return $contents;
}
[download]

The uncommented replacements should have the same effect (untested) and the changes could have a substantial affect on the overall performance of a script dominated by regex manipulations.

While you're at it, you can also add a few micro-optimisations where they are called millions of times like:

sub wash {
    ##### my $towash = $_[0];
    return ( "\n" x ( $_[0] =~ tr/\n// ) );
}
[download]

which will save the 7 seconds spent copying the input parameter. But given that the overall runtime is 7 minutes, that's not going to have a big effect. The only way you're a likely to get substantial savings from within the script, is to try optimising the algorithms used -- which amounts to tuning all of the individual regexes; and the heuristics they represent -- and that comes with enormous risk of breaking the logic completely and would require extensive and detailed testing.

Bottom line

All of that said, if you split the workload across two processors, you're likely to achieve close to a 50% saving. Across 4, and a 75% saving is theoretically possible. It really doesn't make much sense to spend time looking for saving within the script when, with a little restructuring, it lends itself so readily to being parallelised.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: tight loop regex optimization by BrowserUk
in thread tight loop regex optimization by superawesome

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.