Re: How can I do a numeric sort on a substring? [Benchmark]

G'day misterperl,

Other than a number of solutions involving modules, there were several pure Perl solutions. I was interested in how these might compare. The results I got were unexpected; I added a lot of variations (currently up to 14) but the same trend continued.

I had thought that the Schwartzian and Guttman-Rosler transformations would have been the fastest, but that wasn't the case. In fact, those that performed extraction of the number, prior to comparison, within the sort, were the quickest. Given these extractions might have been done multiple times for the same data, I suspect Perl (v5.34.0) is performing some optimisation and perhaps caching results; I would be interested in results others receive using different Perl versions. In case there's concern about the number of tests, the entire benchmark should run in under a minute: a COUNT of zero indicates 3 seconds for each of the 14 tests, plus a few seconds for other overhead.

Here's a rough breakdown of the results:

Schwartzian Transform: slowest. Test names begin with ST.
Guttman-Rosler Transform: second slowest but roughly twice as fast as Schwartzian Transform. Test names begin with GRT.
syphilis' solution: about 40-50% faster than Guttman-Rosler Transform; removing the length calculation added a small boost. Test names begin with mcs.
Other solutions only using sort were the hands-down winners by far. Their speeds varied slightly between runs; the results were too close to call. Test names begin with s.

Here's a sample run. It might be worth using the "download" link to see this without wrapping.

Perl & OS:
    v5.34.0 on cygwin
Unordered data (for preamble tests):
    a-10 a-01 a-22 a-2 a-0 a-3 a-000 a-1 a-12345 a-1
Preamble tests:
    grt_pack_expr:         a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
    grt_pack_expr_q:       a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
    st_regex:              a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
    st_regex_anchored:     a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
    st_regex_anch_expr_ni: a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
    st_regex_anch_ni:      a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
    st_regex_expr_ni:      a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
    st_regex_no_index:     a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
    map_cat_substr:        a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
    map_cat_substr_len:    a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
    sort_pack:             a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
    sort_regex:            a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
    sort_regex_anchored:   a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
    sort_substr:           a-0 a-000 a-01 a-1 a-1 a-2 a-3 a-10 a-22 a-
+12345
Legend:
    GRTpe:  grt_pack_expr
    GRTpeq: grt_pack_expr_q
    STr:    st_regex
    STra:   st_regex_anchored
    STraen: st_regex_anch_expr_ni
    STran:  st_regex_anch_ni
    STren:  st_regex_expr_ni
    STrn:   st_regex_no_index
    mcs:    map_cat_substr
    mcsl:   map_cat_substr_len
    sp:     sort_pack
    sr:     sort_regex
    sra:    sort_regex_anchored
    ss:     sort_substr
Benchmarks:
    Note: Unordered data extended with 'map "a-$_", shuffle 0..10000'
           Rate  STraen   STran    STra    STrn   STren     STr   GRTp
+e  GRTpeq    mcsl     mcs    sp    sr    ss   sra
STraen   68.2/s      --     -0%     -1%     -2%     -3%     -5%    -54
+%    -54%    -60%    -61% -100% -100% -100% -100%
STran    68.6/s      0%      --     -0%     -1%     -3%     -4%    -54
+%    -54%    -60%    -61% -100% -100% -100% -100%
STra     68.6/s      1%      0%      --     -1%     -2%     -4%    -53
+%    -54%    -60%    -61% -100% -100% -100% -100%
STrn     69.3/s      2%      1%      1%      --     -1%     -3%    -53
+%    -54%    -59%    -60% -100% -100% -100% -100%
STren    70.4/s      3%      3%      3%      2%      --     -2%    -52
+%    -53%    -59%    -60% -100% -100% -100% -100%
STr      71.5/s      5%      4%      4%      3%      2%      --    -52
+%    -52%    -58%    -59% -100% -100% -100% -100%
GRTpe     147/s    116%    115%    115%    113%    109%    106%      -
+-     -1%    -13%    -16% -100% -100% -100% -100%
GRTpeq    149/s    119%    118%    118%    116%    112%    109%      1
+%      --    -12%    -15% -100% -100% -100% -100%
mcsl      170/s    149%    148%    148%    145%    141%    138%     15
+%     14%      --     -3% -100% -100% -100% -100%
mcs       175/s    157%    156%    155%    153%    149%    145%     19
+%     17%      3%      -- -100% -100% -100% -100%
sp     263168/s 385629% 383801% 383272% 379410% 373718% 368096% 178350
+% 175994% 154699% 149996%    --   -2%   -2%   -3%
sr     268680/s 393708% 391841% 391301% 387358% 381547% 375807% 182088
+% 179682% 157941% 153139%    2%    --   -0%   -1%
ss     269453/s 394842% 392970% 392428% 388474% 382646% 376890% 182612
+% 180200% 158396% 153581%    2%    0%    --   -0%
sra    270688/s 396651% 394771% 394227% 390254% 384400% 378617% 183450
+% 181026% 159122% 154285%    3%    1%    0%    --
[download]

There was a typo in the code; see update below. The original Benchmarks: section is in the spoiler below.

Here's the code:

#!/usr/bin/env perl

use strict;
use warnings;
use namespace::autoclean;

use Benchmark 'cmpthese';
use List::Util 'shuffle';

my @unordered = qw{a-10 a-01 a-22 a-2 a-0 a-3 a-000 a-1 a-12345 a-1};
my %expanded_abbrev_for = (
    sr => 'sort_regex',
    STr => 'st_regex',
    STrn => 'st_regex_no_index',
    STren => 'st_regex_expr_ni',
    sra => 'sort_regex_anchored',
    STra => 'st_regex_anchored',
    STran => 'st_regex_anch_ni',
    STraen => 'st_regex_anch_expr_ni',
    ss => 'sort_substr',
    mcs => 'map_cat_substr',
    mcsl => 'map_cat_substr_len',
    sp => 'sort_pack',
    GRTpe => 'grt_pack_expr',
    GRTpeq => 'grt_pack_expr_q',
);
my %coderef_for = (
    sr => \&sort_regex,
    STr => \&st_regex,
    STrn => \&st_regex_no_index,
    STren => \&st_regex_expr_ni,
    sra => \&sort_regex_anchored,
    STra => \&st_regex_anchored,
    STran => \&st_regex_anch_ni,
    STraen => \&st_regex_anch_expr_ni,
    ss => \&sort_substr,
    mcs => \&map_cat_substr,
    mcsl => \&map_cat_substr_len,
    sp => \&sort_pack,
    GRTpe => \&grt_pack_expr,
    GRTpeq => \&grt_pack_expr_q,
);

print "Perl & OS:\n    $^V on $^O\n";
print "Unordered data (for preamble tests):\n    @unordered\n";
print "Preamble tests:\n";
my $tests_fmt = "    %-22s %s\n";

for my $name (sort keys %coderef_for) {
    printf $tests_fmt, "$expanded_abbrev_for{$name}:",
            "@{[$coderef_for{$name}->()]}";
}

exit if @ARGV && $ARGV[0] eq '--dry_run';

print "Legend:\n";
my $legend_fmt = "    %-7s %s\n";

for my $abbrev (sort keys %expanded_abbrev_for) {
    printf $legend_fmt, "$abbrev:",
            $expanded_abbrev_for{$abbrev};
}

# Extend @unordered for improved benchmarking
push @unordered, map "a-$_", shuffle 0..10000;

print "Benchmarks:\n";
print "    Note: Unordered data extended with 'map \"a-\$_\", shuffle 
+0..10000'\n";
my $count = 0;
cmpthese $count => \%coderef_for;

sub sort_regex {
    sort {
        ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0]
    } @unordered;
}

sub st_regex {
    map {
        $_->[0]
    }
    sort {
        $a->[1] <=> $b->[1]
    }
    map {
        [$_, (/(\d+)/)[0]]
    } @unordered;
}

sub st_regex_no_index {
    map {
        $_->[0]
    }
    sort {
        $a->[1] <=> $b->[1]
    }
    map {
        [$_, /(\d+)/]
    } @unordered;
}

sub st_regex_expr_ni {
    map $_->[0],
    sort {
        $a->[1] <=> $b->[1]
    }
    map [$_, /(\d+)/], @unordered;
}

sub sort_regex_anchored {
    sort {
        ($a =~ /(\d+)$/)[0] <=> ($b =~ /(\d+)$/)[0]
    } @unordered;
}

sub st_regex_anchored {
    map {
        $_->[0]
    }
    sort {
        $a->[1] <=> $b->[1]
    }
    map {
        [$_, (/(\d+)$/)[0]]
    } @unordered;
}

sub st_regex_anch_ni {
    map {
        $_->[0]
    }
    sort {
        $a->[1] <=> $b->[1]
    }
    map {
        [$_, /(\d+)$/]
    } @unordered;
}

sub st_regex_anch_expr_ni {
    map $_->[0],
    sort {
        $a->[1] <=> $b->[1]
    }
    map [$_, /(\d+)$/], @unordered;
}

sub sort_substr {
    sort {
        substr($a, 2) <=> substr($b, 2)
    } @unordered;
}

sub map_cat_substr {
    map {
        'a-' . $_
    }
    sort {
        $a <=> $b
    }
    map {
        substr $_, 2
    } @unordered;
}

sub map_cat_substr_len {
    map {
        'a-' . $_
    }
    sort {
        $a <=> $b
    }
    map {
        substr $_, 2, length($_) - 2
    } @unordered;
}

sub sort_pack {
    sort {
        pack(L => substr($a, 2)) cmp pack(L => substr($b, 2))
    } @unordered;
}

sub grt_pack_expr {
    map substr($_, 4),
    sort
    map pack(L => substr($_, 2)) . $_, @unordered;
}

sub grt_pack_expr_q {
    map substr($_, 8),
    sort
    map pack(Q => substr($_, 2)) . $_, @unordered;
}
[download]

Notes on pack:

I'm not any great expert with this function. Happy to receive corrections, advice, etc.
The doco indicates that the "Q" template is not universally available. If you don't have it, you might want to remove &grt_pack_expr_q and the two lines referencing it in %expanded_abbrev_for and %coderef_for (the key is "GRTpeq" in both cases). If you do have it, be aware others, with whom you share your code, might not.

Update: There was a typo in my original code: the first key/value pair in %coderef_for was sr => \&sort_substr but it should have been sr => \&sort_regex. I've fixed my code locally and edited the code above to reflect this fix.

I reran the code and have posted a new Benchmarks: section. All of the output before Benchmarks: is unchanged from the original posting. The old Benchmarks: section is now in a spoiler. The results are not particularly different — except sr and ss are no longer identical; although, still too close to call — and my original "rough breakdown of the results" has not changed.

— Ken

Comment on Re: How can I do a numeric sort on a substring? [Benchmark] Select or Download Code

Replies are listed 'Best First'.
Re^2: How can I do a numeric sort on a substring? [Benchmark] by LanX (Saint) on Jun 26, 2021 at 13:28 UTC
> I had thought that the Schwartzian and Guttman-Rosler transformations would have been the fastest, but that wasn't the case. In fact, those that performed extraction of the number, prior to comparison, within the sort, were the quickest you are only testing with 10 elements in your `@unordered` array. for `n` elements you have in best case `O(nlog(n))` comparisons but `O(n)` packs and unpacks with ST and GRT, which means the overhead to extract n numbers will account much more for small n. use at least n >> 1000 elements for a real benchmark. Rule of thumb* : the choice of algorithm is almost always neglectable for small data. update it's like getting the Porsche out of the garage to buy a six-pack of beer just 10m around the corner. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^3: How can I do a numeric sort on a substring? [Benchmark] by kcott (Archbishop) on Jun 27, 2021 at 02:41 UTC
G'day Rolf, "you are only testing with 10 elements in your `@unordered` array." The code only contains one comment which I added for the express purpose of heading off such feedback: `# Extend @unordered for improved benchmarking push @unordered, map "a-$_", shuffle 0..10000;` [download] "use at least n >> 1000 elements for a real benchmark" Indeed. I used 10,011 elements "`for improved benchmarking`". The preamble tests were to check that all subroutines returned identical results. Each of the preamble tests were only run once and did not involve any benchmarking; they were intended as a sanity check (processing stopped here if `--dry_run` was used). The data used for this had 10 elements, which I considered sufficient for these tests, and was indicated by: `Unordered data (for preamble tests):` [download] — Ken	[reply] [d/l] [select]
Re^4: How can I do a numeric sort on a substring? [Benchmark] by LanX (Saint) on Jun 27, 2021 at 11:56 UTC
Hi Ken > I added for the express purpose of heading off such feedback Oops, sorry I missed that. :/ I can't dig deeper now, but some suggestions, to solve the miracle it's still logarithmic, maybe try 1e6 elements instead Perl is propagating the caller's context to the returning statement, maybe you should check if benchmark is using list context too? Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^5: How can I do a numeric sort on a substring? (context matters) by LanX (Saint) on Jun 27, 2021 at 13:46 UTC
Re^6: How can I do a numeric sort on a substring? (context matters) by kcott (Archbishop) on Jun 27, 2021 at 23:22 UTC
Some notes below your chosen depth have not been shown here
Re^6: How can I do a numeric sort on a substring? (context matters) by swl (Prior) on Jun 28, 2021 at 04:00 UTC
Some notes below your chosen depth have not been shown here
Re^2: How can I do a numeric sort on a substring? [Benchmark] by Anonymous Monk on Jun 26, 2021 at 08:18 UTC
my $HAS_Q = !!eval { pack 'Q', 1; 1 }; ...	[reply]