comment on

A minor flamewar erupted on the CB this morning (well, my time anyway) when a user asked for help understanding the following code:

  ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0]
[download]

Apparently, someone told this user¹ that this would solve the problem. Apparently, it did, but, as a good student, he was attempting to really understand it. Specifically, the [0] parts. My suggestion was to learn the Schwarzian Transform, and then we could simplify this. And, as an added bonus, I suggested that it'd probably be faster in that there'd be far fewer regex extractions. Another experienced user took umbrage at this, warning that I shouldn't underestimate the overhead of creating anonymous arrays, nor the smallness of ln N when comparing N regexes in the ST to 2NlnN regexes in the original compare routine. Then this user pointed out the pathological case of pre-sorted lists, where there were only 99 comparisons in a 100-element array - though that's still 198 regexes vs ST's 100. So, I had to look into this more.

Please note, that this came up for two reasons: a) the original user was confused, and I thought that splitting this up would make things easier to understand (and the ST is a useful idiom to learn in general), and b) I thought regex's were relatively expensive (I wouldn't have bothered if it were unpack or substr) for running in a tight loop like this.

So, first the code:

#!/usr/bin/perl
use strict;
use warnings;

sub compare;

my @l;
for (10, 100, 1000)
{
    @l = map { "a$_" } 1..$_;
    compare "Pre-sorted";
    
    @l = reverse @l;
    compare "Reverse-sorted";
    
    use List::Util qw(shuffle);
    @l = shuffle @l;
    compare "Shuffled";
}

my %regexes;
sub brute_force
{
    my $c = 0;
    my @s = sort {
        $c += 2;
        ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0]
    } @l;
    push @{$regexes{force}}, $c;
}

sub ST
{
    my $c = 0;
    my @s = map { $_->[1] } sort { $a->[0] <=> $b->[0] } map {
        $c += 1;
        /(\d+)/;
        [ $1, $_ ]
    } @l;
    push @{$regexes{ST}}, $c;
}

sub compare
{
    use Benchmark qw(cmpthese);

    my $title = shift;
    my $len   = @l;
    $title = "$title ($len).";
    print $title, "\n";
    print '-' x length $title, "\n";
    %regexes = ();

    cmpthese(-1,
             {
                 force => \&brute_force,
                 ST    => \&ST,
             }
            );
    for (qw(force ST))
    {
        use List::Util qw(sum);
        my $avg = sum(@{$regexes{$_}}) / @{$regexes{$_}};
        printf "%10s %d regexes\n", $_, $avg;
    }
    print "\n";
}
[download]

And the results.

Pre-sorted (10).
----------------
         Rate force    ST
force 34461/s    --  -10%
ST    38399/s   11%    --
     force 38 regexes
        ST 10 regexes

Reverse-sorted (10).
--------------------
         Rate    ST force
ST    39097/s    --   -5%
force 41353/s    6%    --
     force 30 regexes
        ST 10 regexes

Shuffled (10).
--------------
         Rate force    ST
force 26795/s    --  -25%
ST    35544/s   33%    --
     force 50 regexes
        ST 10 regexes

Pre-sorted (100).
-----------------
        Rate    ST force
ST    4696/s    --  -30%
force 6698/s   43%    --
     force 198 regexes
        ST 100 regexes

Reverse-sorted (100).
---------------------
        Rate    ST force
ST    4745/s    --  -30%
force 6761/s   43%    --
     force 198 regexes
        ST 100 regexes

Shuffled (100).
---------------
        Rate force    ST
force 1347/s    --  -59%
ST    3257/s  142%    --
     force 1096 regexes
        ST 100 regexes

Pre-sorted (1000).
------------------
       Rate    ST force
ST    439/s    --  -35%
force 673/s   53%    --
     force 1998 regexes
        ST 1000 regexes

Reverse-sorted (1000).
----------------------
       Rate    ST force
ST    427/s    --  -37%
force 673/s   58%    --
     force 1998 regexes
        ST 1000 regexes

Shuffled (1000).
----------------
        Rate force    ST
force 86.3/s    --  -64%
ST     243/s  181%    --
     force 17440 regexes
        ST 1000 regexes
[download]

Of course, the number of regexes for ST is obvious - it's the brute-force method where it's not always obvious.

The conclusion I've reached is: don't overestimate the overhead of creating a list of anonymous arrays, the hugeness of lnN, the unimportance of the pathological case, nor the overhead of the regex engine when in a tight loop. Even for small N, a shuffled list provided enough reason to use ST if your purpose is speed and you're using regexes to pull out the comparison information. This conclusion may scale to other scenarios, but without benchmarking them, I won't speak to them ;-) And all this is in addition to my continued claim that the ST idiom is useful in creating easier-to-maintain code (with the presumption that the maintainer knows the ST idiom, too).

¹ For privacy reasons, I'm not naming any user involved in the discussion other than myself. If they want to out themselves, that's their decision, not mine, to make.

In reply to Underestimage (ST vs brute force) by Tanktalus

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.