TravelAddict has asked for the wisdom of the Perl Monks concerning the following question:

Hello there, I have to compare HTML strings, and I need to ignore the presence of certain tags so that if two strings differ only because of one of those tags, the strings are considered as identical. I'm actually successfully doing this right now, but I have a serious performance issue. The script running on a medium-sized file takes 4 hours to complete, and this is not acceptable. To do this right now, I have created a sub-routine returning TRUE (if the strings are identical), or FALSE if the strings are different. This will ignore the tags that we not need to consider:
sub compare { # We ignore the standalone, opening and closing tags <ph />, <bpt x= +"y">, <ept x="z">, <i /> my $cString1 = $_[0]; my $cString2 = $_[1]; $cString1 =~ s/\/?<(bpt|ept|ph|i)[^<>]*>//gsmi; $cString2 =~ s/\/?<(bpt|ept|ph|i)[^<>]*>//gsmi; ("$cString1" eq "$cString2") ? return 1 : return 0; }
I use this sub-routine later in my code:
if (compare($unit{$x}{string1}, $string2)) { # Do some stuff }
As we can see, I actually do the substitution on the strings before comparing them, and I suspect that's very heavy, so I would want to simply ignore the patterns instead of removing them. Anyway, I do not want to modify any of the strings, this is for comparison only. Any way I could improve the performance? Thanks in advance! TA

Replies are listed 'Best First'.
Re: Ignoring patterns when comparing strings
by Corion (Patriarch) on Jun 28, 2018 at 12:14 UTC

    Without looking into it any deeper, a simple approach is to stop recalculating the "cleaned" version of a string for each comparison. For that, you will need to move the cleaning out of the function compare and up in to the loop calling compare:

    for my $string1 (@strings_left) { my $string1_clean = clean( $string1 ); for my $string2 (@strings_right) { if( compare($string1_clean, clean( $string2 )) { ... } }; };

    If you have some more memory, you can Memoize the cleanup of the string. This would speed up cleaning up strings a bit more.

    But maybe you can save more comparison time by first sorting all your strings into buckets based on the first (few) characters of the string. There is no way that a string starting with "A" will be equal to a string starting with "B". That could cut down on the total number of comparisons made.

      Thanks, Corion. Your reply made me realize that I'm actually doing the substitutions multiple times on the same strings. If I have a list of 10 strings, I compare the first one with the second one, then the third, etc. until the end, then repeat starting with the second string and compare it to the third, and so on. So instead of doing the substitution multiple times, I'll do it once, store the results in a new variable used for comparison only, and I'll use the real value when it's time to write something. Brilliant! I could also probably improve the algorithm itself, but if I stop doing the substitution multiple times on the same string, I should get an acceptable execution time. And thanks for the tip about Memoize! I'm sure that this one will be handy in some other cases. Have a great day! TA
Re: Ignoring patterns when comparing strings
by vr (Curate) on Jun 29, 2018 at 00:06 UTC

    Another concern, to add to what Corion said, is that you unnecessarily interpolate all your data into double-quoted strings, as many times as you (previously) did substitution, i.e. on each comparison, and that leads to making temporary copies of strings.

    However, what I noticed, is that you are going to continue to

    compare the first one with the second one, then the third, etc. until the end, then repeat starting with the second string and compare it to the third, and so on.

    -- I don't know the size of your data and strings length, but judging by "4 hours" -- it's inefficient. Consider:

    use strict; use warnings; use feature 'say'; use Data::Dump; use Time::HiRes 'time'; use Digest::xxHash; use Algorithm::Combinatorics 'combinations'; srand( 1122 ); my $len = 1000; my $vol = 2000; my @data = map { join '', map { chr rand 256 } 1 .. $len; } 1 .. $vol; # add some duplication @data = ( @data, ( grep { rand > .8 } @data ), ( grep { rand > .4 } @data ), ( grep { rand > .2 } @data ), ); my %h1; my %h2; sub do_something {}; ########################### my $t = time; for my $i ( 0 .. $#data ) { for my $j ( $i + 1 .. $#data ) { do_something( $i, $j ) if $data[ $i ] eq $data[ $j ] } } say time - $t; ########################### $t = time; push @{ $h2{ $data[ $_ ]}}, $_ for 0 .. $#data; for ( grep { @$_ > 1 } values %h2 ) { do_something( @$_ ) for combinations( $_, 2 ) } say time - $t; __END__ 1.1641149520874 0.022386074066162

    I.e. temp hash of arrays of indices of equal strings is ~50 times faster than your approach, with data size chosen above.

    (BTW, I experimented with Digest::xxHash (not actually used in code above). It's claimed to provide extremely fast yet high quality hashing. With my "data" and hardware, hashing first with this module's functions and using digests as keys for Perl, begins to outperform Perl's built-in hashing, if $len is above 5000, with further gains up to 300%. But that's only a side-note.)

Re: Ignoring patterns when comparing strings
by kcott (Archbishop) on Jun 29, 2018 at 12:00 UTC

    G'day TravelAddict,

    "Any way I could improve the performance?"

    There's a number of things you could do (I acknowledge that some of these have already been mentioned in one form or another by others).

    With the regex:

    • My general rule-of-thumb is to not give the regex engine any more work than is absolutely necessary.
    • I don't know what '\/?' is for. Unless there's something about your data which you haven't shown, or maybe I missed it, I'd remove it.
    • If you have some insider knowledge about the frequency of the tags, use that order for the alternation; otherwise, I'd suggest using the shorter tag names before the long ones (less characters to check first).
    • You've used a lot of unnecessary modifiers: you haven't used '.' in the regex, so the 's' modifier will have no effect; you haven't used either of the '^' or '$' assertions, so the 'm' modifier will have no effect; all characters are lowercase, so the 'i' modifier will have no effect.

    With the return value:

    • Whatever the last statement in a subroutine evaluates to, will be returned by default (no explicit 'return' is needed).
    • '("$cString1" eq "$cString2")' returns a boolean. No need for the ternary operator or the two 'return' statements.
    • '("$cString1" eq "$cString2")' can be reduced to '$cString1 eq $cString2' (removing two stringification operations).

    I played around with various ways to speed up your routine. If you have Perl 5.14 or higher, here's a drop-in replacement:

    sub compare { state $stripped; ($stripped->{$_[0]} //= $_[0] =~ s/<(?:i|ph|bpt|ept) [^>]*?>//gr) eq ($stripped->{$_[1]} //= $_[1] =~ s/<(?:i|ph|bpt|ept) [^>]*?>//gr); }

    [If you have an earlier version of Perl, and don't know how to wind back this code, just ask.]

    I ran some benchmarks. I've no idea what your real data looks like or how big it is (neither record length nor number of records): the results I got should be viewed tenatively; you should run your own benchmarks with representative data. This code ran about 4 times faster than your posted 'compare()' (I took the fact that $stripped cached results across runs and skewed the results into consideration — actual results I got indicated more than 5 times faster).

    Here's the benchmark code (pm_1217567_cmp_bench.pl):

    !/usr/bin/env perl use 5.014; use warnings; use Benchmark 'cmpthese'; my @data = ( 'XX', 'XY', 'X<ph />X', 'X<ph />Y', 'X<bpt x="y">X', 'X<bpt x="y">Y', 'X<ept x="z">X', 'X<ept x="z">Y', 'X<i />X', 'X<i />Y', 'X<bpt x="y">XX<i />XX<i />X', 'X<bpt x="y">XX<i />XX<i />Y', ); if (@ARGV && $ARGV[0] eq 'bench') { cmpthese 0 => { orig => sub { run_comp(\&compare, 1) }, kens => sub { run_comp(\&kens_cmp, 1) }, }; } else { say '*** Original ***'; say '-' x 60; sub { run_comp(\&compare) }->(); say '*** Ken\'s ***'; say '-' x 60; sub { run_comp(\&kens_cmp) }->(); } sub run_comp { my ($cmp_func, $bench) = @_; for my $d1 (@data) { for my $d2 (@data) { if (defined $bench) { $cmp_func->($d1, $d2); } else { say "'$d1' and '$d2': ", $cmp_func->($d1, $d2) ? "SAME" : "DIFFER"; } } } } sub compare { # We ignore the standalone, opening and closing tags <ph />, <bpt x= +"y">, <ept x="z">, <i /> my $cString1 = $_[0]; my $cString2 = $_[1]; $cString1 =~ s/\/?<(bpt|ept|ph|i)[^<>]*>//gsmi; $cString2 =~ s/\/?<(bpt|ept|ph|i)[^<>]*>//gsmi; ("$cString1" eq "$cString2") ? return 1 : return 0; } sub kens_cmp { state $stripped; ($stripped->{$_[0]} //= $_[0] =~ s/<(?:i|ph|bpt|ept) [^>]*?>//gr) eq ($stripped->{$_[1]} //= $_[1] =~ s/<(?:i|ph|bpt|ept) [^>]*?>//gr); }

    I've set that up so you can run the benchmark like this:

    $ ./pm_1217567_cmp_bench.pl bench Rate orig kens orig 2035/s -- -81% kens 10576/s 420% --

    Or you can run it without benchmarking, to capture and check the results. There's a lot of output, you'll probably want to pipe it into your favourite pager or redirect to a file. I did this:

    $ ./pm_1217567_cmp_bench.pl > ./pm_1217567_cmp_bench_both.out $ cp ./pm_1217567_cmp_bench_both.out ./pm_1217567_cmp_bench_orig.out $ cp ./pm_1217567_cmp_bench_both.out ./pm_1217567_cmp_bench_kens.out $ vi ./pm_1217567_cmp_bench_orig.out # /Ken<NL>dGZZ $ vi ./pm_1217567_cmp_bench_kens.out # /Ken<NL>kmm1Gd'mZZ $ diff ./pm_1217567_cmp_bench_orig.out ./pm_1217567_cmp_bench_kens.out 1c1 < *** Original *** --- > *** Ken's *** $

    I don't know what your level of Perl knowledge is; feel free to ask if there's anything you don't follow.

    — Ken