comment on

Thanks for the positive feedback. I have some comments on your first three points.

Re "... fastest way to remove leading and trailing white space ...". I've also seen the documentation about anchors; I can't remember where; I have an inkling it may have been in a book: the regex I used was anchored at both ends (/^\s*(.*?)\s*$/). In terms of two easy vs. one complex regex, that's going to depend on relative complexity and the string operated on. I wrote this benchmark:

#!/usr/bin/env perl -l

use strict;
use warnings;
use constant STRING => " \t aaa bbb ccc \t \n";

use Benchmark 'cmpthese';

print 'Sanity Tests:';
print 'shoura:    >', shoura_code(),   '<';
print 'kcott:     >', kcott_code(),    '<';
print 'marshall:  >', marshall_code(), '<';

cmpthese 0 => {
    S => \&shoura_code,
    K => \&kcott_code,
    M => \&marshall_code,
};

sub shoura_code {
    local $_ = STRING;

    chomp;
    s/^\s+|\s+$//g;

    return $_;
}

sub kcott_code {
    local $_ = STRING;

    ($_) = /^\s*(.*?)\s*$/;

    return $_;
}

sub marshall_code {
    local $_ = STRING;

    s/^\s+//;
    s/\s+$//;

    return $_;
}
[download]

I ran it five times — that's usual for me — here's the result that was closest to an average:

Sanity Tests:
shoura:    >aaa bbb ccc<
kcott:     >aaa bbb ccc<
marshall:  >aaa bbb ccc<
      Rate    S    M    K
S 292306/s   -- -32% -37%
M 432626/s  48%   --  -7%
K 464863/s  59%   7%   --
[download]

There was quite a lot of variance; although 'K' was always faster than 'M'. The five K-M percentages were: 9, 7, 2, 14, 7. Both 'K' and 'M' were always substantially faster than 'S'.

Re "... split your $re statement into two parts ...". I often use the '@{[...]}' construct when interpolating the results of some processing into a string. My main intent was to create the regex once, instead of the (presumably) millions of times in the inner loop of the OP's code. I also benchmarked this (see the spoiler): it looks like your total saving would be measured in nanoseconds.

Re "I see no need at all to sort the search terms, ... The OP's requirement "for a sorted order" makes no sense to me at all.". I can understand that from the minimal test data supplied by the OP; however, the reason is probably to handle sequences with common sections. Consider the test data I used in the second benchmark:

my %seq = (
    'W X Y' => 'WbbbXbbbY',
    'X Y'   => 'XbbbY',
    'X Y Z' => 'XbbbYbbbZ',
);
[download]

If the target string was "W X Y Z", the results could one of these three:

W XbbbY Z
WbbbXbbbY Z
W XbbbYbbbZ
[download]

Sorting by length would reduce that to two results. There may well be a requirement to also sort lexically. Perhaps like this:

sort { length $b <=> length $a || $a cmp $b }
[download]

But the OP has not given sufficient information. In fact, as I write this, it's been almost two days since the original posting and all requests for additional information have been ignored.

— Ken

In reply to Re^3: script optmization by kcott
in thread script optmization by shoura

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.