Re^3: script optmization

Thanks for the positive feedback. I have some comments on your first three points.

Re "... fastest way to remove leading and trailing white space ...". I've also seen the documentation about anchors; I can't remember where; I have an inkling it may have been in a book: the regex I used was anchored at both ends (/^\s*(.*?)\s*$/). In terms of two easy vs. one complex regex, that's going to depend on relative complexity and the string operated on. I wrote this benchmark:

#!/usr/bin/env perl -l

use strict;
use warnings;
use constant STRING => " \t aaa bbb ccc \t \n";

use Benchmark 'cmpthese';

print 'Sanity Tests:';
print 'shoura:    >', shoura_code(),   '<';
print 'kcott:     >', kcott_code(),    '<';
print 'marshall:  >', marshall_code(), '<';

cmpthese 0 => {
    S => \&shoura_code,
    K => \&kcott_code,
    M => \&marshall_code,
};

sub shoura_code {
    local $_ = STRING;

    chomp;
    s/^\s+|\s+$//g;

    return $_;
}

sub kcott_code {
    local $_ = STRING;

    ($_) = /^\s*(.*?)\s*$/;

    return $_;
}

sub marshall_code {
    local $_ = STRING;

    s/^\s+//;
    s/\s+$//;

    return $_;
}
[download]

I ran it five times — that's usual for me — here's the result that was closest to an average:

Sanity Tests:
shoura:    >aaa bbb ccc<
kcott:     >aaa bbb ccc<
marshall:  >aaa bbb ccc<
      Rate    S    M    K
S 292306/s   -- -32% -37%
M 432626/s  48%   --  -7%
K 464863/s  59%   7%   --
[download]

There was quite a lot of variance; although 'K' was always faster than 'M'. The five K-M percentages were: 9, 7, 2, 14, 7. Both 'K' and 'M' were always substantially faster than 'S'.

Re "... split your $re statement into two parts ...". I often use the '@{[...]}' construct when interpolating the results of some processing into a string. My main intent was to create the regex once, instead of the (presumably) millions of times in the inner loop of the OP's code. I also benchmarked this (see the spoiler): it looks like your total saving would be measured in nanoseconds.

Re "I see no need at all to sort the search terms, ... The OP's requirement "for a sorted order" makes no sense to me at all.". I can understand that from the minimal test data supplied by the OP; however, the reason is probably to handle sequences with common sections. Consider the test data I used in the second benchmark:

my %seq = (
    'W X Y' => 'WbbbXbbbY',
    'X Y'   => 'XbbbY',
    'X Y Z' => 'XbbbYbbbZ',
);
[download]

If the target string was "W X Y Z", the results could one of these three:

W XbbbY Z
WbbbXbbbY Z
W XbbbYbbbZ
[download]

Sorting by length would reduce that to two results. There may well be a requirement to also sort lexically. Perhaps like this:

sort { length $b <=> length $a || $a cmp $b }
[download]

But the OP has not given sufficient information. In fact, as I write this, it's been almost two days since the original posting and all requests for additional information have been ignored.

— Ken

Comment on Re^3: script optmization Select or Download Code