I can't help with the matching itself, but for a slight adjustment, you might try benchmarking one of the following changes to your for loops:

for ( my $i = $size; $i--; ) { ... } for ( my $j = $i; $j--; ) { ... }

or

foreach my $i ( 0 .. $size-1 ) { ... } foreach my $j ( $i+1 .. $size-1 ) { ... }

or

foreach my $i ( 0 .. $size-1 ) { ... } foreach my $string2 { @{$arrayDocs}[ $i+1 .. $size-1 ] } { # change references to '$arrayDocs->[$j]' to '$string2' }

The first one can save a couple of operations per cycle (most likely, not significant compared to the contents of the loops, but it might shave off a second or two, and it works in other languages (see below)). The second one assumes that perl's optimization of iterating through a list of integers is faster than a 'for' loop (see below), and the last one tries to save time by reducing the number of times $arrayDocs->[$j] is referenced.

You'd have to test the last one for yourself, as it's going to be affected by the qualities of the data (how many times you actually match)

I know, people are going to complain that I'm optimizing the wrong part, but well, if it shaves off a few seconds at 5k records, it should take off ~900 times that amount at 150k records

# s/iter orig backwards foreach # orig 10.6 -- -45% -55% # backwards 5.80 83% -- -18% # foreach 4.75 123% 22% --
use Benchmark qw(cmpthese); my $size = 5000; my $orig = sub { for (my $i = 0; $i < ($size - 1); $i++) { for (my $j = $i + 1; $j < ($size - 1); $j++) { } } }; my $backwards = sub { for ( my $i = $size; $i--; ) { for ( my $j = $i; $j--; ) { } } }; my $foreach = sub { foreach my $i ( 0 .. $size-1 ) { foreach my $j ( $i+1 .. $size-1 ) { } } }; cmpthese ( 10 , { orig => $orig, backwards => $backwards, foreach => $ +foreach } );

In reply to Re: Fast string similarity method by jhourcle
in thread Fast string similarity method by icanwin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.