I had a site that was running way too slow. I determined that the problem was that the same search and replace regexes were being compiled over and over again. I was doing stream parsing of html with HTML::PARSE, so I could not change the program flow. I decided to compile the regexes and then use them over and over and found out that is was not as simple as it seemed. See the code below.
#!/usr/bin/perl -w use strict; my $lhs = "abc"; my ($string1, $string2,$string3, $string4); $string1 = "qrsabcwty"; $string2 = "qrsabcwty"; $string3 = "qrsabcwty"; $string4 = "qrsabcwty"; my %lhsCompiled; #lets say the regex we want is $string1 =~ s/($lhs)/XXX$1/; # Then I compile it. my $rhsCompiled = qr{XXX$1}; $lhsCompiled{$lhs} = qr{($lhs)}; # This regex will not work $string2 =~ s/$lhsCompiled{$lhs}/$rhsCompiled/; # The left hand side works, but the right had side comes # out as (?-xism:XXXabc) # it returns qrs(?-xism:XXXabc)wty # instead I make the right hand side into a code block # using the e modifier which identifies the right hand side # as code block or a subroutine call. sub lhsSub {return "XXX".$1} $string3 =~ s/$lhsCompiled{$lhs}/lhsSub/e; $string4 =~ s/$lhsCompiled{$lhs}/"XXX".$1/e; # I suspect using a subroutine is faster. It will be compiled, # I am not certain if the code will be compiled print "string1 $string1\n"; print "string2 $string2\n"; print "string3 $string3\n"; print "string4 $string4\n"; #The results are # string1 qrsXXXabcwty # string2 qrs(?-xism:XXXabc)wty # string3 qrsXXXabcwty # string4 qrsXXXabcwty
You can see the code at work at Truespel Converter

Replies are listed 'Best First'.
Re: Compiling Search/Replace Regexs
by diotalevi (Canon) on Sep 02, 2005 at 22:30 UTC

    You are mistaken in when things are compiled. Please see /o is dead, long live qr//! for an in depth look at when regular expressions are compiled. That description completely ignores the right hand side (RHS) of a search-replace operation for good reason. It is already compiled at during the normal BEGIN-time compile time. A static, non-interpoloating RHS is a constant for perl. That's the fastest. A static interpolating RHS (use of $1 will trigger this) is a plain concatenation. This is also as optimal as it can get. Pushing the concatentation into a subroutine means you still do the same concatenation but you also incur the cost of a function call. Obviously, opting to do non-productive work is going to slow things down.

    Please trust that your RHS is already compiled appropriately when you write things like s/.../RHS $1/. It is not recompiled whenever the left hand side (LHS) changes.

    The document I linked to informs you that the LHS is recompiled anytime the s/// or m// was not given a complete, compiled regexp and the stringification of the regexp is not identical to the last time the regexp was compiled. In practice, this means that if you use the same pattern more than once in a row, it is compiled only once. Depending on whether you are passing in a string or a compiled regexp object, you may incur the cost of doing a string equality check. This is the same code that eq uses.

    Please note that it is entirely innappropriate to consider the use of qr// for your RHS. This should be obvious to you by not but if not, note that now. You only ever use the result of a qr// on the LHS.

    If you have multiple regexes you wish to cache and you aren't going to name each one individually. Here, you store a compiled regex in a hash keyed to whatever the text of the regex is. Your replacement uses the text as a lookup to get the compiled version.

    # $cache{$regex} ||= qr/$regex/; $string =~ s/$cache{$regex}/XXX$1/

    If you have a single regex you'd like to name and cache, this is even nicer. You can avoid a hash lookup.

    $something_regex = qr/$regex/; $string =~ s/$something_regex/XXX$1/
Re: Compiling Search/Replace Regexs
by fergal (Chaplain) on Sep 02, 2005 at 23:08 UTC
    The right hand side is not a regex so you shouldn't try to compile it. In the example above you could just do
    s/$lhsCompiled{$lhs}/XXX$1/
    If you want to vary the right hand side depending on $lhs then you could do
    s/$lhsCompiled{$lhs}/&{$rhsSubs{$lhs}}($1)/
    so that each left hand side can have a corresponding right hand side.
Re: Compiling Search/Replace Regexs
by QM (Parson) on Sep 02, 2005 at 21:08 UTC
    If you thought the regex was causing the problem, a subroutine may or may not be faster.

    You need to Benchmark your code with data representative of your application. If you post a sample of what you're really trying to do, some nice monk here might collect all of the ideas into a benchmark and post it here (which you can then run on your target system for comparison).

    BTW, how do you know it was the regex?

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      I knew it was the regex slowing me down, by forcing a compilation with the o modifier, my code immediatly sped up ( with the wrong results ). The final version using the compilation system I showed sped up the wall clock running time to do cnn.com from 22 seconds to 7 seconds,