G'day TravelAddict,
"Any way I could improve the performance?"
There's a number of things you could do
(I acknowledge that some of these have already been mentioned in one form or another by others).
With the regex:
-
My general rule-of-thumb is to not give the regex engine any more work than is absolutely necessary.
-
I don't know what '\/?' is for.
Unless there's something about your data which you haven't shown,
or maybe I missed it, I'd remove it.
-
If you have some insider knowledge about the frequency of the tags, use that order for the alternation;
otherwise, I'd suggest using the shorter tag names before the long ones (less characters to check first).
-
You've used a lot of unnecessary modifiers:
you haven't used '.' in the regex, so the 's' modifier will have no effect;
you haven't used either of the '^' or '$' assertions, so the 'm' modifier will have no effect;
all characters are lowercase, so the 'i' modifier will have no effect.
With the return value:
-
Whatever the last statement in a subroutine evaluates to, will be returned by default
(no explicit 'return' is needed).
-
'("$cString1" eq "$cString2")' returns a boolean.
No need for the ternary operator or the two 'return' statements.
-
'("$cString1" eq "$cString2")' can be reduced to '$cString1 eq $cString2'
(removing two stringification operations).
I played around with various ways to speed up your routine.
If you have Perl 5.14 or higher, here's a drop-in replacement:
sub compare {
state $stripped;
($stripped->{$_[0]} //= $_[0] =~ s/<(?:i|ph|bpt|ept) [^>]*?>//gr)
eq
($stripped->{$_[1]} //= $_[1] =~ s/<(?:i|ph|bpt|ept) [^>]*?>//gr);
}
[If you have an earlier version of Perl,
and don't know how to wind back this code, just ask.]
I ran some benchmarks.
I've no idea what your real data looks like or how big it is (neither record length nor number of records):
the results I got should be viewed tenatively; you should run your own benchmarks with representative data.
This code ran about 4 times faster than your posted 'compare()'
(I took the fact that $stripped cached results across runs and skewed the results into consideration
— actual results I got indicated more than 5 times faster).
Here's the benchmark code (pm_1217567_cmp_bench.pl):
!/usr/bin/env perl
use 5.014;
use warnings;
use Benchmark 'cmpthese';
my @data = (
'XX', 'XY', 'X<ph />X', 'X<ph />Y',
'X<bpt x="y">X', 'X<bpt x="y">Y',
'X<ept x="z">X', 'X<ept x="z">Y',
'X<i />X', 'X<i />Y',
'X<bpt x="y">XX<i />XX<i />X',
'X<bpt x="y">XX<i />XX<i />Y',
);
if (@ARGV && $ARGV[0] eq 'bench') {
cmpthese 0 => {
orig => sub { run_comp(\&compare, 1) },
kens => sub { run_comp(\&kens_cmp, 1) },
};
}
else {
say '*** Original ***';
say '-' x 60;
sub { run_comp(\&compare) }->();
say '*** Ken\'s ***';
say '-' x 60;
sub { run_comp(\&kens_cmp) }->();
}
sub run_comp {
my ($cmp_func, $bench) = @_;
for my $d1 (@data) {
for my $d2 (@data) {
if (defined $bench) {
$cmp_func->($d1, $d2);
}
else {
say "'$d1' and '$d2': ",
$cmp_func->($d1, $d2) ? "SAME" : "DIFFER";
}
}
}
}
sub compare {
# We ignore the standalone, opening and closing tags <ph />, <bpt x=
+"y">, <ept x="z">, <i />
my $cString1 = $_[0];
my $cString2 = $_[1];
$cString1 =~ s/\/?<(bpt|ept|ph|i)[^<>]*>//gsmi;
$cString2 =~ s/\/?<(bpt|ept|ph|i)[^<>]*>//gsmi;
("$cString1" eq "$cString2") ? return 1 : return 0;
}
sub kens_cmp {
state $stripped;
($stripped->{$_[0]} //= $_[0] =~ s/<(?:i|ph|bpt|ept) [^>]*?>//gr)
eq
($stripped->{$_[1]} //= $_[1] =~ s/<(?:i|ph|bpt|ept) [^>]*?>//gr);
}
I've set that up so you can run the benchmark like this:
$ ./pm_1217567_cmp_bench.pl bench
Rate orig kens
orig 2035/s -- -81%
kens 10576/s 420% --
Or you can run it without benchmarking, to capture and check the results.
There's a lot of output, you'll probably want to pipe it into your favourite pager or redirect to a file.
I did this:
$ ./pm_1217567_cmp_bench.pl > ./pm_1217567_cmp_bench_both.out
$ cp ./pm_1217567_cmp_bench_both.out ./pm_1217567_cmp_bench_orig.out
$ cp ./pm_1217567_cmp_bench_both.out ./pm_1217567_cmp_bench_kens.out
$ vi ./pm_1217567_cmp_bench_orig.out # /Ken<NL>dGZZ
$ vi ./pm_1217567_cmp_bench_kens.out # /Ken<NL>kmm1Gd'mZZ
$ diff ./pm_1217567_cmp_bench_orig.out ./pm_1217567_cmp_bench_kens.out
1c1
< *** Original ***
---
> *** Ken's ***
$
I don't know what your level of Perl knowledge is;
feel free to ask if there's anything you don't follow.
|