Re^2: multiple substitution

I answered a similar question recently with a loop:

    $s =~ s/$_/$h{$_}/g for keys %h;
[download]

So I wondered how that would compare to your solution of combining the searches into a single regex. I thought your way might win for a few words, but surely with a lot of words the complexity of the regex would slow it down, right?

Well, so much for that theory. The Perl regex engine continues to amaze me. I gave it a pattern combining 676 strings (all two-letter combinations) with pipes like yours, and it blew the forloop method away (92 times faster). It also beat a regex solution using Regexp::Assemble, but I was using very simple and known search strings, so the hand-made pipe method was safe and simple. With unknown or more complex strings, making it harder to hand-make a safe and efficient search pattern, I think RA would probably come out on top eventually. Anyway, my test and results:

abaugher@bannor> cat 989705.pl 
#!/usr/bin/env perl
use Modern::Perl;
use Benchmark qw(:all);
use Regexp::Assemble;

my %h = map { $_ => uc } ( 'aa' .. 'zz' );
my $s = `cat bigfile`; # 8MB file

say "Testing with @{[-s 'bigfile']} byte file and @{[ scalar keys %h ]
+} patterns";

cmpthese( 10, {
        'forloop' => \&forloop,
        'pipes'   => \&pipes,
        'regexpa' => \&regexpa,
});

sub forloop {
        $s =~ s/$_/$h{$_}/g for keys %h;
}

sub pipes {
        my $p = join '|', keys %h;
        $s =~ s/($p)/$h{$1}/g;
}

sub regexpa {
        my $p = Regexp::Assemble->new->add(keys %h)->re;
        $s =~ s/($p)/$h{$1}/g;
}
abaugher@bannor> perl 989705.pl                               
Testing with 8560854 byte file and 676 patterns
              Rate forloop regexpa   pipes
forloop 9.75e-02/s      --    -96%    -99%
regexpa     2.40/s   2364%      --    -74%
pipes       9.08/s   9213%    278%      --
[download]

Aaron B.
Available for small or large Perl jobs; see my home node.

Comment on Re^2: multiple substitution Select or Download Code

Replies are listed 'Best First'.
Re^3: multiple substitution by AnomalousMonk (Archbishop) on Aug 25, 2012 at 18:11 UTC
The `pipes()` and `regexpa()` functions used in the timing loops above both include generation of the matching regexes in each loop execution. I doubt it adds greatly to the overall execution time, but is it proper to include regex generation in the timing of a substitution operation? On a more critical note, a substitution is done on the `$s` string in each repetition of each timing loop, but will there be anything to be found for substitution after the first pass of whatever timing function happens to be executed first? Are not all subsequent passes in all functions just comparing the time it takes for a regex to find no match in a string? (Maybe take the 8MB file content and `x` it into three identical 200 - 500MB strings and do just one comparison pass of substitutions on each string.)	[reply] [d/l] [select]
Re^3: multiple substitution by Corion (Patriarch) on Aug 25, 2012 at 16:48 UTC
I only (re)used what the OP had as a regular expression already. But your results mesh well with When Perl Isn't Quite Fast Enough - the less ops you need, and the more you can do within the RE engine, the faster your Perl code is.	[reply]