G'day TravelAddict,
"Any way I could improve the performance?"
There's a number of things you could do (I acknowledge that some of these have already been mentioned in one form or another by others).
With the regex:
With the return value:
I played around with various ways to speed up your routine. If you have Perl 5.14 or higher, here's a drop-in replacement:
sub compare { state $stripped; ($stripped->{$_[0]} //= $_[0] =~ s/<(?:i|ph|bpt|ept) [^>]*?>//gr) eq ($stripped->{$_[1]} //= $_[1] =~ s/<(?:i|ph|bpt|ept) [^>]*?>//gr); }
[If you have an earlier version of Perl, and don't know how to wind back this code, just ask.]
I ran some benchmarks. I've no idea what your real data looks like or how big it is (neither record length nor number of records): the results I got should be viewed tenatively; you should run your own benchmarks with representative data. This code ran about 4 times faster than your posted 'compare()' (I took the fact that $stripped cached results across runs and skewed the results into consideration — actual results I got indicated more than 5 times faster).
Here's the benchmark code (pm_1217567_cmp_bench.pl):
!/usr/bin/env perl use 5.014; use warnings; use Benchmark 'cmpthese'; my @data = ( 'XX', 'XY', 'X<ph />X', 'X<ph />Y', 'X<bpt x="y">X', 'X<bpt x="y">Y', 'X<ept x="z">X', 'X<ept x="z">Y', 'X<i />X', 'X<i />Y', 'X<bpt x="y">XX<i />XX<i />X', 'X<bpt x="y">XX<i />XX<i />Y', ); if (@ARGV && $ARGV[0] eq 'bench') { cmpthese 0 => { orig => sub { run_comp(\&compare, 1) }, kens => sub { run_comp(\&kens_cmp, 1) }, }; } else { say '*** Original ***'; say '-' x 60; sub { run_comp(\&compare) }->(); say '*** Ken\'s ***'; say '-' x 60; sub { run_comp(\&kens_cmp) }->(); } sub run_comp { my ($cmp_func, $bench) = @_; for my $d1 (@data) { for my $d2 (@data) { if (defined $bench) { $cmp_func->($d1, $d2); } else { say "'$d1' and '$d2': ", $cmp_func->($d1, $d2) ? "SAME" : "DIFFER"; } } } } sub compare { # We ignore the standalone, opening and closing tags <ph />, <bpt x= +"y">, <ept x="z">, <i /> my $cString1 = $_[0]; my $cString2 = $_[1]; $cString1 =~ s/\/?<(bpt|ept|ph|i)[^<>]*>//gsmi; $cString2 =~ s/\/?<(bpt|ept|ph|i)[^<>]*>//gsmi; ("$cString1" eq "$cString2") ? return 1 : return 0; } sub kens_cmp { state $stripped; ($stripped->{$_[0]} //= $_[0] =~ s/<(?:i|ph|bpt|ept) [^>]*?>//gr) eq ($stripped->{$_[1]} //= $_[1] =~ s/<(?:i|ph|bpt|ept) [^>]*?>//gr); }
I've set that up so you can run the benchmark like this:
$ ./pm_1217567_cmp_bench.pl bench Rate orig kens orig 2035/s -- -81% kens 10576/s 420% --
Or you can run it without benchmarking, to capture and check the results. There's a lot of output, you'll probably want to pipe it into your favourite pager or redirect to a file. I did this:
$ ./pm_1217567_cmp_bench.pl > ./pm_1217567_cmp_bench_both.out $ cp ./pm_1217567_cmp_bench_both.out ./pm_1217567_cmp_bench_orig.out $ cp ./pm_1217567_cmp_bench_both.out ./pm_1217567_cmp_bench_kens.out $ vi ./pm_1217567_cmp_bench_orig.out # /Ken<NL>dGZZ $ vi ./pm_1217567_cmp_bench_kens.out # /Ken<NL>kmm1Gd'mZZ $ diff ./pm_1217567_cmp_bench_orig.out ./pm_1217567_cmp_bench_kens.out 1c1 < *** Original *** --- > *** Ken's *** $
I don't know what your level of Perl knowledge is; feel free to ask if there's anything you don't follow.
— Ken
In reply to Re: Ignoring patterns when comparing strings
by kcott
in thread Ignoring patterns when comparing strings
by TravelAddict
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |