Another concern, to add to what Corion said, is that you unnecessarily interpolate all your data into double-quoted strings, as many times as you (previously) did substitution, i.e. on each comparison, and that leads to making temporary copies of strings.

However, what I noticed, is that you are going to continue to

compare the first one with the second one, then the third, etc. until the end, then repeat starting with the second string and compare it to the third, and so on.

-- I don't know the size of your data and strings length, but judging by "4 hours" -- it's inefficient. Consider:

use strict; use warnings; use feature 'say'; use Data::Dump; use Time::HiRes 'time'; use Digest::xxHash; use Algorithm::Combinatorics 'combinations'; srand( 1122 ); my $len = 1000; my $vol = 2000; my @data = map { join '', map { chr rand 256 } 1 .. $len; } 1 .. $vol; # add some duplication @data = ( @data, ( grep { rand > .8 } @data ), ( grep { rand > .4 } @data ), ( grep { rand > .2 } @data ), ); my %h1; my %h2; sub do_something {}; ########################### my $t = time; for my $i ( 0 .. $#data ) { for my $j ( $i + 1 .. $#data ) { do_something( $i, $j ) if $data[ $i ] eq $data[ $j ] } } say time - $t; ########################### $t = time; push @{ $h2{ $data[ $_ ]}}, $_ for 0 .. $#data; for ( grep { @$_ > 1 } values %h2 ) { do_something( @$_ ) for combinations( $_, 2 ) } say time - $t; __END__ 1.1641149520874 0.022386074066162

I.e. temp hash of arrays of indices of equal strings is ~50 times faster than your approach, with data size chosen above.

(BTW, I experimented with Digest::xxHash (not actually used in code above). It's claimed to provide extremely fast yet high quality hashing. With my "data" and hardware, hashing first with this module's functions and using digests as keys for Perl, begins to outperform Perl's built-in hashing, if $len is above 5000, with further gains up to 300%. But that's only a side-note.)


In reply to Re: Ignoring patterns when comparing strings by vr
in thread Ignoring patterns when comparing strings by TravelAddict

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.