Re: Ignoring patterns when comparing strings

Another concern, to add to what Corion said, is that you unnecessarily interpolate all your data into double-quoted strings, as many times as you (previously) did substitution, i.e. on each comparison, and that leads to making temporary copies of strings.

However, what I noticed, is that you are going to continue to

compare the first one with the second one, then the third, etc. until the end, then repeat starting with the second string and compare it to the third, and so on.

-- I don't know the size of your data and strings length, but judging by "4 hours" -- it's inefficient. Consider:

use strict;
use warnings;
use feature 'say';
use Data::Dump;
use Time::HiRes 'time';

use Digest::xxHash;

use Algorithm::Combinatorics 'combinations';

srand( 1122 );

my $len = 1000;
my $vol = 2000;

my @data = map {
    join '', map { chr rand 256 } 1 .. $len;
} 1 .. $vol;

# add some duplication

@data = ( @data,
    ( grep { rand > .8 } @data ),
    ( grep { rand > .4 } @data ),
    ( grep { rand > .2 } @data ),
);    

my %h1;
my %h2;

sub do_something {};

###########################

my $t = time;

for my $i ( 0 .. $#data ) {
    for my $j ( $i + 1 .. $#data ) {
        do_something( $i, $j ) 
            if $data[ $i ] eq $data[ $j ]
    }
}

say time - $t;

###########################

$t = time;

push @{ $h2{ $data[ $_ ]}}, $_ 
    for 0 .. $#data;

for ( grep { @$_ > 1 } values %h2 ) {
    do_something( @$_ ) 
        for combinations( $_, 2 )
}

say time - $t;

__END__

1.1641149520874
0.022386074066162
[download]

I.e. temp hash of arrays of indices of equal strings is ~50 times faster than your approach, with data size chosen above.

(BTW, I experimented with Digest::xxHash (not actually used in code above). It's claimed to provide extremely fast yet high quality hashing. With my "data" and hardware, hashing first with this module's functions and using digests as keys for Perl, begins to outperform Perl's built-in hashing, if $len is above 5000, with further gains up to 300%. But that's only a side-note.)

Comment on Re: Ignoring patterns when comparing strings Select or Download Code