comment on

I couldn't leave well enough alone on this...

The regexp solution benefits from the very efficient regexp engine. But it is a solution that is built upon a big-O polynomial algorithm. If we expand the problem to finding uniqueness in strings consisting of three-character-wide groups of alphabetical characters, that gives us a lot of room for dataset growth while maintaining a string of unique groups. The hash solution grows at O(n) since each hash insert occurs at an average of O(1). I can't quite figure out how bad the regular expression approach gets as the string grows, but it's probably something like O(n^2) or worse.

For short test strings the raw speed of the regexp engine wins over the complexity of the hashing algorithm. But for longer strings, there's literally no comparison. Here's some test code:

use strict;
use warnings;

use Benchmark qw( cmpthese ) ;

use vars qw/$tuplets $template/;

$tuplets = join '', ( 'aaa' .. 'caa' );
$template = join '', 'a3' x ( length( $tuplets ) / 3 );

print "Test string contains ", 
      length( $tuplets ) / 3, 
      " groups.\n\n";

cmpthese( 
    -10, 
    {
        regexp => sub {
            return $tuplets !~ /^(?:.{3})*(.{3})(?:.{3})*\1/;
        },
        hash => sub {
            my %hash;
            @hash{ unpack $template, $tuplets } = ();
            return( 
                length( $tuplets ) / 3 == keys( %hash )
            );
        }
    }
);
[download]

And the results on my slow Pentium-II laptop:

Test string contains 1353 groups.

          s/iter regexp   hash
regexp      1.15     --   -98%
hash   1.84e-002  6123%     --
[download]

At first I thought my eyes were decieving me. 1.84e-002 iterations per second? That's horrible. But then I realized that the regexp solution was so slow that Benchmark switched to showing seconds per iteration. So it takes 1.15 seconds per iteration for the regexp approach in my test example, and a blink of an eye (1.84e-002) for the hash approach with a test string of 1353 groups. Try testing 'aaa' .. 'faa'. You'll have to increase the testing time about a minute to even get reliable results out of Benchmark at that point because the regexp approach becomes so sluggish.

Of course this is a contrived example, but aren't they all? ;) And I did have to modify the RE a little so that it would maintain proper framing. But the discussion caught my attention and I just had to prove to myself what I already suspected.

Dave

In reply to Re: Determining uniqueness in a string. by davido
in thread Determining uniqueness in a string. by Yzzyx

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.