Underestimage (ST vs brute force)

A minor flamewar erupted on the CB this morning (well, my time anyway) when a user asked for help understanding the following code:

  ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0]
[download]

Apparently, someone told this user¹ that this would solve the problem. Apparently, it did, but, as a good student, he was attempting to really understand it. Specifically, the [0] parts. My suggestion was to learn the Schwarzian Transform, and then we could simplify this. And, as an added bonus, I suggested that it'd probably be faster in that there'd be far fewer regex extractions. Another experienced user took umbrage at this, warning that I shouldn't underestimate the overhead of creating anonymous arrays, nor the smallness of ln N when comparing N regexes in the ST to 2NlnN regexes in the original compare routine. Then this user pointed out the pathological case of pre-sorted lists, where there were only 99 comparisons in a 100-element array - though that's still 198 regexes vs ST's 100. So, I had to look into this more.

Please note, that this came up for two reasons: a) the original user was confused, and I thought that splitting this up would make things easier to understand (and the ST is a useful idiom to learn in general), and b) I thought regex's were relatively expensive (I wouldn't have bothered if it were unpack or substr) for running in a tight loop like this.

So, first the code:

#!/usr/bin/perl
use strict;
use warnings;

sub compare;

my @l;
for (10, 100, 1000)
{
    @l = map { "a$_" } 1..$_;
    compare "Pre-sorted";
    
    @l = reverse @l;
    compare "Reverse-sorted";
    
    use List::Util qw(shuffle);
    @l = shuffle @l;
    compare "Shuffled";
}

my %regexes;
sub brute_force
{
    my $c = 0;
    my @s = sort {
        $c += 2;
        ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0]
    } @l;
    push @{$regexes{force}}, $c;
}

sub ST
{
    my $c = 0;
    my @s = map { $_->[1] } sort { $a->[0] <=> $b->[0] } map {
        $c += 1;
        /(\d+)/;
        [ $1, $_ ]
    } @l;
    push @{$regexes{ST}}, $c;
}

sub compare
{
    use Benchmark qw(cmpthese);

    my $title = shift;
    my $len   = @l;
    $title = "$title ($len).";
    print $title, "\n";
    print '-' x length $title, "\n";
    %regexes = ();

    cmpthese(-1,
             {
                 force => \&brute_force,
                 ST    => \&ST,
             }
            );
    for (qw(force ST))
    {
        use List::Util qw(sum);
        my $avg = sum(@{$regexes{$_}}) / @{$regexes{$_}};
        printf "%10s %d regexes\n", $_, $avg;
    }
    print "\n";
}
[download]

And the results.

Pre-sorted (10).
----------------
         Rate force    ST
force 34461/s    --  -10%
ST    38399/s   11%    --
     force 38 regexes
        ST 10 regexes

Reverse-sorted (10).
--------------------
         Rate    ST force
ST    39097/s    --   -5%
force 41353/s    6%    --
     force 30 regexes
        ST 10 regexes

Shuffled (10).
--------------
         Rate force    ST
force 26795/s    --  -25%
ST    35544/s   33%    --
     force 50 regexes
        ST 10 regexes

Pre-sorted (100).
-----------------
        Rate    ST force
ST    4696/s    --  -30%
force 6698/s   43%    --
     force 198 regexes
        ST 100 regexes

Reverse-sorted (100).
---------------------
        Rate    ST force
ST    4745/s    --  -30%
force 6761/s   43%    --
     force 198 regexes
        ST 100 regexes

Shuffled (100).
---------------
        Rate force    ST
force 1347/s    --  -59%
ST    3257/s  142%    --
     force 1096 regexes
        ST 100 regexes

Pre-sorted (1000).
------------------
       Rate    ST force
ST    439/s    --  -35%
force 673/s   53%    --
     force 1998 regexes
        ST 1000 regexes

Reverse-sorted (1000).
----------------------
       Rate    ST force
ST    427/s    --  -37%
force 673/s   58%    --
     force 1998 regexes
        ST 1000 regexes

Shuffled (1000).
----------------
        Rate force    ST
force 86.3/s    --  -64%
ST     243/s  181%    --
     force 17440 regexes
        ST 1000 regexes
[download]

Of course, the number of regexes for ST is obvious - it's the brute-force method where it's not always obvious.

The conclusion I've reached is: don't overestimate the overhead of creating a list of anonymous arrays, the hugeness of lnN, the unimportance of the pathological case, nor the overhead of the regex engine when in a tight loop. Even for small N, a shuffled list provided enough reason to use ST if your purpose is speed and you're using regexes to pull out the comparison information. This conclusion may scale to other scenarios, but without benchmarking them, I won't speak to them ;-) And all this is in addition to my continued claim that the ST idiom is useful in creating easier-to-maintain code (with the presumption that the maintainer knows the ST idiom, too).

¹ For privacy reasons, I'm not naming any user involved in the discussion other than myself. If they want to out themselves, that's their decision, not mine, to make.

Comment on Underestimage (ST vs brute force) Select or Download Code

Replies are listed 'Best First'.
Re: Underestimage (ST vs brute force) by ikegami (Patriarch) on Dec 09, 2008 at 17:57 UTC
I wasn't the one who suggested the non-ST solution, but I did bring up the possibility that ST might be slower for smaller sets in this case. Thanks for showing it's not. This contradicts numbers I thought I had seen in the past. By the way, I think you'll get even more speed if you avoid curlies around `map`. `my @s = map $_->[0], sort { $a->[1] <=> $b->[1] } map [ $_, (/(\d+)/)[0] ], @l;` [download]	[reply] [d/l] [select]
Re^2: Underestimage (ST vs brute force) by Tanktalus (Canon) on Dec 09, 2008 at 19:17 UTC
I'm sure we could get more speed that way, but that does defeat the primary purpose behind my suggestion for the ST: maintenance. You put back in the one thing I was trying to eliminate for the poor original petitioner: `(/(\d+)/)[0]`. By using a block, focused on a single item at a time, I could split that into multiple statements to make it easier to read. The first curlies could be removed with little ill effect on readability, but the curlies on the second map would be a negative to readability. And, so, I suggest, a premature optimisation :-P	[reply] [d/l]
Re: Underestimage (ST vs brute force) by eric256 (Parson) on Dec 09, 2008 at 20:38 UTC
Providing the following just because I'm slow and it might help someone else as well. I had to reformat your code in order to "see" it ;) and added some comments just because. `my @s = map { $_->[1] } # get back the original value ( +$_ from the third step in the map) sort { $a->[0] <=> $b->[0] } # sort on $1 from below map { $c += 1; # count the regexs /(\d+)/; # put the digits from $_ into $ +1 [ $1, $_ ] # store $1, and $_ for later re +trival } @l;` [download] I liked the benchmarking code too, very nice setup. ;) ___________ Eric Hodges	[reply] [d/l]
Re: Underestimage (ST vs brute force) by Porculus (Hermit) on Dec 10, 2008 at 23:31 UTC
For interest, I tried expanding the test to include the CPAN module Sort::Key, which basically does the same thing as an ST, only faster and without the confusing syntax. The pertinent part of the new code reads `use Sort::Key 'ikeysort'; sub sort_key { my $c = 0; my @s = ikeysort { $c += 1; /(\d+)/; $1 } @l; push @{$regexes{key}}, $c; }` [download] which is then passed to the tests in exactly the same way as the existing brute_force() and ST(). It works out about twice as fast as the ST on the shuffled lists, but is only about 10% faster than the ST on sorted lists, where brute force still wins.	[reply] [d/l]
Re: Underestimage (ST vs brute force) by brian_d_foy (Abbot) on Dec 11, 2008 at 20:06 UTC
I would expect the Schwartzian Transform to be slower when the number of items is small, the computation of the secondary key is very easy, and maybe in some pathological cases. That's how we teach it in our "Intermediate Perl" class; we make the students benchmark it. It's the subject of Wasting time thinking about wasted time. Uri Guttman and Larry Rosler spent a lot of time thinking about sorting and wrote a nice paper on the subject that includes about every method they could find. -- brian d foy <brian@stonehenge.com> Subscribe to The Perl Review	[reply]