G'day Marshall,

Thanks for the positive feedback. I have some comments on your first three points.

Re "... fastest way to remove leading and trailing white space ...". I've also seen the documentation about anchors; I can't remember where; I have an inkling it may have been in a book: the regex I used was anchored at both ends (/^\s*(.*?)\s*$/). In terms of two easy vs. one complex regex, that's going to depend on relative complexity and the string operated on. I wrote this benchmark:

#!/usr/bin/env perl -l use strict; use warnings; use constant STRING => " \t aaa bbb ccc \t \n"; use Benchmark 'cmpthese'; print 'Sanity Tests:'; print 'shoura: >', shoura_code(), '<'; print 'kcott: >', kcott_code(), '<'; print 'marshall: >', marshall_code(), '<'; cmpthese 0 => { S => \&shoura_code, K => \&kcott_code, M => \&marshall_code, }; sub shoura_code { local $_ = STRING; chomp; s/^\s+|\s+$//g; return $_; } sub kcott_code { local $_ = STRING; ($_) = /^\s*(.*?)\s*$/; return $_; } sub marshall_code { local $_ = STRING; s/^\s+//; s/\s+$//; return $_; }

I ran it five times — that's usual for me — here's the result that was closest to an average:

Sanity Tests: shoura: >aaa bbb ccc< kcott: >aaa bbb ccc< marshall: >aaa bbb ccc< Rate S M K S 292306/s -- -32% -37% M 432626/s 48% -- -7% K 464863/s 59% 7% --

There was quite a lot of variance; although 'K' was always faster than 'M'. The five K-M percentages were: 9, 7, 2, 14, 7. Both 'K' and 'M' were always substantially faster than 'S'.

Re "... split your $re statement into two parts ...". I often use the '@{[...]}' construct when interpolating the results of some processing into a string. My main intent was to create the regex once, instead of the (presumably) millions of times in the inner loop of the OP's code. I also benchmarked this (see the spoiler): it looks like your total saving would be measured in nanoseconds.

Re "I see no need at all to sort the search terms, ... The OP's requirement "for a sorted order" makes no sense to me at all.". I can understand that from the minimal test data supplied by the OP; however, the reason is probably to handle sequences with common sections. Consider the test data I used in the second benchmark:

my %seq = ( 'W X Y' => 'WbbbXbbbY', 'X Y' => 'XbbbY', 'X Y Z' => 'XbbbYbbbZ', );

If the target string was "W X Y Z", the results could one of these three:

W XbbbY Z WbbbXbbbY Z W XbbbYbbbZ

Sorting by length would reduce that to two results. There may well be a requirement to also sort lexically. Perhaps like this:

sort { length $b <=> length $a || $a cmp $b }

But the OP has not given sufficient information. In fact, as I write this, it's been almost two days since the original posting and all requests for additional information have been ignored.

— Ken


In reply to Re^3: script optmization by kcott
in thread script optmization by shoura

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.